With the emergence of the big data, new jobs have appeared demanding new sets of skills and expertise for extracting value from data (Axis Talent, 2020; ai-jobs.net, 2023):
Which one is the most valued in the Mexican labor market currently?
To identify which data job category has the highest salary in the Mexican labor market in November 2023 according to the OCC website.
Which data job category has the highest salary in the Mexican labor market in November 2023 according to the OCC website?
The Data Architect position has the highest salary in the Mexican labor market in November 2023 according to the OCC website.
The methodology of the present study is based on Rollin’s Foundational Methodology for Data Science (Rollins, 2015).
Furthermore:
The full project can be found at the GitHub repository.
In this context, the purpose of the present notebook is to perform the process of data exploration, data preparation, data analysis, data visualization as well as the statistical analysis.
# Loading Requirements Text File
!pip install -r requirements.txt
# Libraries installation
# !pip install seaborn
# !pip install folium
# !pip install plotly
# !pip install scipy
# !pip install statsmodels
# !pip install -U kaleido
# !pip install nbformat
# !pip install scikit_posthocs
# Libraries importation
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib_inline.backend_inline import set_matplotlib_formats
import seaborn as sns
import folium
from folium.plugins import MarkerCluster
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scikit_posthocs import posthoc_tukey, posthoc_dunn
from sklearn.utils import resample
import re
import os
# Matplolib settings
# Setting theme and plot resolution
sns.set_theme(context = 'notebook', style = 'darkgrid')
mpl.rcParams["figure.dpi"] = 100
mpl.rcParams["savefig.dpi"] = 300
set_matplotlib_formats('svg')
# Setting default plot's aesthetics
plotfontcolor = 'dimgray'
mpl.rcParams['text.color'] = plotfontcolor
mpl.rcParams['axes.labelcolor'] = plotfontcolor
mpl.rcParams['xtick.color'] = plotfontcolor
mpl.rcParams['ytick.color'] = plotfontcolor
mpl.rcParams["font.size"] = 10
mpl.rcParams['axes.titlesize'] = 14
mpl.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 11
mpl.rcParams["axes.labelweight"] = "bold"
mpl.rcParams["axes.titleweight"] = "bold"
#mpl.rcParams['font.family'] = 'sans-serif'
#mpl.rcParams['font.family'] = 'serif'
# Plotly settings
# Setting Notebook's output rendering
pio.renderers.default = "svg"
# Configuration for exporting Plotly charts for publication
config = {
'toImageButtonOptions': {
'format': 'png', # one of png, svg, jpeg, webp
'filename': 'Plot',
'scale': 2 # Multiply title/legend/axis/canvas sizes by this factor
}
}
# Setting Font Properties for Plotly
font_px = dict(size=14, color ='dimgray')
# Setting Legend aesthetics for Plotly
legend_px = dict(bgcolor='#f0f0f0',
bordercolor='#cbcccd',
borderwidth=1.5)
# Setting Heatmap Color Scale for Plotly
heatmap_px_colorscale = [(0, "#000000"),
(0.03, px.colors.sequential.Blues[8]),
(0.25, px.colors.sequential.Blues[6]),
(0.5, px.colors.sequential.Blues[4]),
(0.75, px.colors.sequential.Blues[2]),
(1, "aliceblue")]
heatmap_px_categoryarray = ['BI Analyst',
'Business Analyst',
'Data Analyst',
'Data Architect',
'Data Engineer',
'Data Scientist',
'ML Engineer']
As shown in the previous notebook, and in accordance with the defined data requirements: job name, salary, company and location; the data was collected by performing a web scraping process on the the OCC Website (Mexico) on 26 November 2022. Python 3 and its libraries Selenium and BeautifulSoup were used in this process.
To do so, key words were used (in both English and Spanish) for encompassing the above-mentioned data jobs:
jobs_list = ["analista datos",
"data analyst",
"cientifico datos",
"data scientist",
"ingeniero datos",
"data engineer",
"arquitecto datos",
"data architect",
"analista negocio",
"business analyst",
"bi analyst",
"business intelligence",
"aprendizaje automatico",
"machine learning"]
Furthermore, the most important challenge faced during the step of data collection were the dynamic class ids in the OCC website. Indeed, it was found that the class ids for the salary, location and company tags changed every few minutes. Thus, regular expressions were used to scrape as much as possible data from the html. Please refer to the notebook for all the details.
The collected dataset was stored in GitHub repository from this project.
# Data importation
df = pd.read_csv('https://raw.githubusercontent.com/DanielEduardoLopez/DataJobs-Nov2023-MX/main/Dataset_raw.csv')
After the data collected from the web scraping process was loaded to a Pandas dataframe. Then, the dataset was explored with the Pandas and Seaborn libraries to gain some understanding and preliminary insights.
# Preview of the dataset
df.head(10)
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | Analista de datos | $20,000 - $25,000 Mensual | SKANDIA OPERADORA DE F... | Polanco II Sección, Miguel Hidalgo, CDMX |
| 1 | Analista de Datos | Sueldo no mostrado por la empresa | GENERAL DE SEGUROS | CDMX |
| 2 | Analista de datos MDM | $17,000 - $20,000 Mensual | Grupo Alnus S.A. de C.V. | Monterrey, N.L. |
| 3 | ANALISTA DE DATOS ECOMMERCE | $16,000 Mensual | Grupo Daisa | Cuauhtémoc, CDMX |
| 4 | Analista de datos maestros | $10,000 Mensual | Empresa confidencial | Monterrey, N.L. |
| 5 | Coordinador analista de datos | Sueldo no mostrado por la empresa | Empresa confidencial | Miguel Hidalgo, CDMX |
| 6 | ANALISTA DE DATOS / EXPERTO EN EXCEL | $11,000 - $13,000 Mensual | Empresa confidencial | Zapopan, Jal. |
| 7 | Analista de Datos en Azure (6 meses) | $55,000 - $60,000 Mensual | SYGNO | Monterrey, N.L. |
| 8 | Analista de Datos (Power Bi-Tableau) | $29,000 Mensual | Empresa confidencial | CDMX |
| 9 | Analista de base de datos | $25,000 - $27,000 Mensual | SERVICIOS DE ENERGIA E... | Monterrey, N.L. |
df.tail(10)
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 3812 | Trabajo Desde Casa Desarrollador .NET Backend ... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3813 | Work From Home Junior QA Automation / R+D/ Ref... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3814 | Work From Home .NET + Angular Architect / R + ... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3815 | Work From Home SemiSenior Angular Developer / ... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3816 | Work From Home QA Automation Developer / R + D... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3817 | Work From Home Python Tech Lead / Research + D... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3818 | Trabajo Desde Casa Desarrollador Junior/MidLev... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3819 | Work From Home Angular Architect / R+D / Ref. ... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3820 | Work From Home Angular Tech Lead / R+D / Ref. ... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
| 3821 | Work From Home Senior Angular Developer / R+D ... | Sueldo no mostrado por la empresa | BAIRESDEV LLC / (Tax I... | Guadalajara, Jal. |
df.columns
Index(['Job', 'Salary', 'Company', 'Location'], dtype='object')
In accordance with the set data requirements, the dataset contains four columns: Job, Salary, Company, and Location.
It can be observed that data lacks consistency in every variable. Some observations are in uppercase, salary values are strings, locations descriptions comprise the state and sometimes include the city, etc.
Furthermore, it is also clear that some jobs does not conform with any of the previously defined data jobs, so, the dataset must be cleaned appropriately.
Nonetheless, some further exploration will be carried out to develop a more complete strategy for doing so.
# Dataset size
df.shape
(3822, 4)
So, we have 3822 original observations in total.
# Basic dataset info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3822 entries, 0 to 3821 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job 3822 non-null object 1 Salary 3822 non-null object 2 Company 3822 non-null object 3 Location 3606 non-null object dtypes: object(4) memory usage: 119.6+ KB
Firstly, all of the columns are strings. This is expectable for the Job, Company, and Location variables; but no for Salary, which must be appropriately wrangled.
From the results above it also can be seen that the variables Job, Salary and Location do not exhibit null values; whereas Location do. Indeed, in a few cases, the vacancy did not disclose the working location.
# Basic dataset description
df.describe()
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| count | 3822 | 3822 | 3822 | 3606 |
| unique | 924 | 249 | 419 | 145 |
| top | Data Scientist | Sueldo no mostrado por la empresa | Empresa confidencial | CDMX |
| freq | 70 | 2346 | 534 | 1076 |
From the results above, it is noteworthy that Data Scientist is the most common job in the dataset; whereas most of the vacancies correspond to confidential salaries and companies.
In addition to that, there were 924 different job titles, which should be conformed to the 7 defined data jobs above:
In the present section, the dataset was assessed to identify the existence of missing values in the dataset,
# Function to calculate the percentage of missing values for each column in the dataset
def missing_values_table(df):
"""
Calculates the number of missing values and its corresponding percentage of total values
for each attribute in a pandas dataframe.
Parameters
df: Dataset in the format of a Pandas dataframe
Returns
mis_val_table: Table with the missing values percentage for each attribute
"""
mis_val = df.isnull().sum()
mis_val_percent = (mis_val / len(df)) * 100
mis_val_table = pd.concat([pd.Series(mis_val.index), pd.Series(mis_val.values),
pd.Series(mis_val_percent.values)], axis=1)
mis_val_table = mis_val_table.rename(
columns = {0: 'Attribute', 1 : 'Missing Values', 2: '% of Total Values'})
mis_val_table = mis_val_table[
mis_val_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Dataset has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table.shape[0]) + " attributes that have missing values.")
return mis_val_table
missing_values_table(df)
Dataset has 4 columns. There are 1 attributes that have missing values.
| Attribute | Missing Values | % of Total Values | |
|---|---|---|---|
| 3 | Location | 216 | 5.7 |
Thus, about 5.7% of the observations in the location variable are null.
In view of the low percent of missing values, it is possible to safely remove the observed vacancies that do not state a working location. However, these vacancies might correspond to remote working locations, so, it has preferible to impute the missing values with "Remote/NA".
In this section, it was assessed whether the dataset cover all the cases required, which, in this case, are observations for each data job.
# Job list definition
jobs_list = ["analista de datos",
"data analyst",
"cientifico de datos",
"data scientist",
"ingeniero de datos",
"data engineer",
"arquitecto de datos",
"data architect",
"analista de negocio",
"business analyst",
"bi analyst",
"business intelligence",
"aprendizaje automatico",
"machine learning"]
# Assessment of completeness
completeness = pd.DataFrame({'Job': 0, 'True Ocurrences': 0, 'False Ocurrences': 0}, index=[0])
for job in jobs_list:
bool_series = df['Job'].str.lower().isin([job]).value_counts()
false_ocur = bool_series.values[0]
try:
true_ocur = bool_series.values[1]
except:
true_ocur = 0
new_row = pd.DataFrame({'Job': job, 'True Ocurrences': true_ocur, 'False Ocurrences': false_ocur}, index=[0])
completeness = pd.concat([completeness, new_row]).reset_index(drop=True)
completeness = completeness.iloc[1:]
completeness
| Job | True Ocurrences | False Ocurrences | |
|---|---|---|---|
| 1 | analista de datos | 120 | 3702 |
| 2 | data analyst | 48 | 3774 |
| 3 | cientifico de datos | 8 | 3814 |
| 4 | data scientist | 78 | 3744 |
| 5 | ingeniero de datos | 66 | 3756 |
| 6 | data engineer | 88 | 3734 |
| 7 | arquitecto de datos | 0 | 3822 |
| 8 | data architect | 22 | 3800 |
| 9 | analista de negocio | 0 | 3822 |
| 10 | business analyst | 54 | 3768 |
| 11 | bi analyst | 10 | 3812 |
| 12 | business intelligence | 12 | 3810 |
| 13 | aprendizaje automatico | 0 | 3822 |
| 14 | machine learning | 0 | 3822 |
So, taking into account the job titles in both English and Spanish, there are observations for almost all data jobs except for the position *Machine Learning Engineer, which is not the ideal scenario, but it is enough to proceed with the present analysis.
In this section, the data types for each attribute and observation were assessed to identify incorrect data types.
# Setting columns by data type
string_cols = ['Job','Company', 'Location']
numerical_cols = ['Salary']
# regex pattern to identify a number
number_pattern = "^\d+\.?\d*$"
# Creating empty lists
numbers_list = []
incorrect_numeric_list = []
incorrect_string_list = []
# Detecting incorrect data types in expected string columns
for col in df[string_cols].columns:
i = 0
for observation in df[string_cols][col].values:
if re.findall(number_pattern, str(observation)):
incorrect_string_list.append({"Attribute": col, "Index": i, "Incorrect Datatype Observation": observation, 'Expected Datatype': 'string'})
i += 1
# Detecting incorrect data types in expected numerical columns
for col in df[numerical_cols].columns:
i = 0
for observation in df[numerical_cols][col].values:
if re.findall(number_pattern, str(observation)):
numbers_list.append(observation)
if observation not in numbers_list:
incorrect_numeric_list.append({"Attribute": col, "Index": i, "Incorrect Datatype Observation": observation, 'Expected Datatype': 'float'})
i += 1
# Displaying results as a dataframe
incorrect_data_types = pd.DataFrame(data = (incorrect_string_list + incorrect_numeric_list))
incorrect_data_types
| Attribute | Index | Incorrect Datatype Observation | Expected Datatype | |
|---|---|---|---|---|
| 0 | Salary | 0 | $20,000 - $25,000 Mensual | float |
| 1 | Salary | 1 | Sueldo no mostrado por la empresa | float |
| 2 | Salary | 2 | $17,000 - $20,000 Mensual | float |
| 3 | Salary | 3 | $16,000 Mensual | float |
| 4 | Salary | 4 | $10,000 Mensual | float |
| ... | ... | ... | ... | ... |
| 3817 | Salary | 3817 | Sueldo no mostrado por la empresa | float |
| 3818 | Salary | 3818 | Sueldo no mostrado por la empresa | float |
| 3819 | Salary | 3819 | Sueldo no mostrado por la empresa | float |
| 3820 | Salary | 3820 | Sueldo no mostrado por la empresa | float |
| 3821 | Salary | 3821 | Sueldo no mostrado por la empresa | float |
3822 rows × 4 columns
As noted above, the Salary attribute exhibited an incorrect string data type for all the observations. Thus, the non-numeric characters should be removed and the ranges should be transformed into an average.
In this section, the data was explored.
# Value counts for each job position
df[['Job']].value_counts()
Job
Data Scientist 70
Analista de Datos 62
Data engineer 58
Ingeniero de datos 46
Analista de datos 40
..
CARRIER ANALYST 2
CATMAN - BI 2
CHOFER MENSAJERO 2
CHOFER REFACCIONES 2
Data Quality Sr. Analyst 2
Name: count, Length: 924, dtype: int64
According to the result above, Data Scientist is the most demanded data job. However, it is noteworthy that Analista de datos (Spanish for Data Analyst) appears several times in the list with different combinations of upper and lower case. So, before to visually exploring the dataset, it is a good idea to homologate all the observations to lower case.
932 different positions titles are present in the data set, while only 7 are required!
# Visual exploration of the Job variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Job'].str.lower().value_counts().values[0:20],
y = df['Job'].str.lower().value_counts().index[0:20],
hue = df['Job'].value_counts().index[0:20],
palette="Blues_r",
legend = False,
alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Job title')
plt.title('Most Common Data Jobs in Dataset')
plt.show()
It appears that Analista de datos (Spanish for Data Analyst) is the most demanded position, followed by Data engineer and Data scientist.
Notwithstanding the above, it is necessary to homologate the job titles that are in English and Spanish; as well as homologating the different variations for the same job title.
On the other hand, other job titles appear in the data set that are errors or do not correspond to data jobs.
# Visual exploration of the Job variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.scatterplot(x = df['Job'].str.lower().unique(),
y = df['Job'].str.lower().value_counts(),
color="Green", alpha = 0.9)
plt.xticks([])
plt.xlabel('Job Title')
plt.ylabel('Frequency')
plt.title('Job Titles Frequency in Dataset')
plt.show()
The plot above shows that the largest part of the job titles retrieved from the web scraping are unique values. Thus, this suggests that the job titles variable is fairly inconsistent and/or it exhibits a lot of other positions which are not of interest for this study.
# Value counts for each job position
df[['Location']].value_counts()
Location
CDMX 1076
Miguel Hidalgo, CDMX 406
Guadalajara, Jal. 392
Monterrey, N.L. 238
Cuauhtémoc, CDMX 72
...
Lomas Altas, Miguel Hidalgo, CDMX 2
Lerma, Edo. Méx. 2
La Paz, BCS. 2
La Magdalena Contreras, CDMX 2
Mexicali, BC. 2
Name: count, Length: 145, dtype: int64
# Visual exploration of the Location variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Location'].value_counts().values[0:20],
y = df['Location'].value_counts().index[0:20],
hue = df['Location'].value_counts().index[0:20],
legend = False,
palette="Blues_r",
alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Location')
plt.title('Most Common Locations in Dataset')
plt.show()
As shown in the plot above, in some cases, only the name of the state appears while in others the name of the municipality is also included. So, a further cleaning is necessary to map all the locations to their corresponding Mexican states.
Notwithstanding with the above, it is clear that CDMX (Mexico City) is the location where the most data jobs are demanded.
# Value counts for each job position
df[['Company']].value_counts()
Company
Empresa confidencial 534
BAIRESDEV LLC / (Tax I... 524
Banamex 204
Adecco 60
Manpower, S.A. de C.V. 56
...
ETAM SAPI DE CV 2
Raul Barragán de la Cruz 2
Red Acero 2
Ricoh Mexicana S.A. de... 2
DATALOGIC SA de CV 2
Name: count, Length: 419, dtype: int64
# Visual exploration of the Company variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Company'].value_counts().values[0:20],
y = df['Company'].value_counts().index[0:20],
hue = df['Company'].value_counts().index[0:20],
legend = False,
palette="Blues_r",
alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Company')
plt.title('Most Common Companies in Dataset')
plt.show()
In the dataset, most of the vacancies correspond to confidential companies; however, from those whose a hiring company is available, BAIRESDEV is the company with the highest number of data jobs vacancies. It is followed by Banamex, Adecco and Manpower.
# Visual exploration of the Company variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.scatterplot(x = df['Company'].value_counts().index,
y = df['Company'].value_counts().values,
hue = df['Company'].value_counts().values,
legend = False,
color="Salmon",
alpha = 0.9)
plt.xticks([])
plt.xlabel('Company')
plt.ylabel('Frequency')
plt.title('Companies Frequency in Dataset')
plt.show()
The plot above shows that most companies only offer one or two data-related positions.
# Value counts for each Salary
df['Salary'].value_counts()
Salary
Sueldo no mostrado por la empresa 2346
$60,000 - $70,000 Mensual 38
$25,000 - $30,000 Mensual 36
$30,000 - $35,000 Mensual 30
$20,000 - $25,000 Mensual 28
...
$11,000 - $11,300 Mensual 2
$10,400 Mensual 2
$6,600 Mensual 2
$6,999 - $7,000 Mensual 2
$18,000 - $25,000 Mensual 2
Name: count, Length: 249, dtype: int64
# Visual exploration of the Company variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Salary'].value_counts().values[0:20],
y = df['Salary'].value_counts().index[0:20],
hue = df['Salary'].value_counts().index[0:20],
legend = False,
palette ="Blues_r",
alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Salary')
plt.title('Salaries in Dataset')
plt.show()
Obviously, is necessary to clean the data in order to get more meaningful insights regarding the salary. However, it is noteworthy that most of the published vacancies have an undisclosed salary ("Sueldo no mostrado por la empresa").
For the vacancies that have a disclosed salary, a salary range is most commonly published rather than a single value. And, in this sense, it appears that a salary from 60,000 MXN to 70,000 MXN is the most common, which is probably not accurate due to the fact that the salary variable is a string instead of an actual number.
Of course, the periodicity of the payment ("Mensual": monthly) must be removed from all observations.
After the data was explored, some wrangling strategies to effectively clean the data were defined. In this sense, some of the procedures applied to the data were:
From the raw data obtained through web scraping, it has been noticed that much of the vacancies published lack of salary data. However, those positions without said data might be useful for assessing the geographical distribution of the data jobs in Mexico as well as identiying which companies demand more data specialists even if they have not published a proposed salary.
Thus, two datasets were prepared:
As some companies republish their vacancies in the website in order for their vacancies to appear in the first places of the search results, it deemed necessary to drop the duplicate positions.
# Drop of duplicates
df = df.drop_duplicates()
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1182 entries, 0 to 3541 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Job 1182 non-null object 1 Salary 1182 non-null object 2 Company 1182 non-null object 3 Location 1117 non-null object dtypes: object(4) memory usage: 46.2+ KB
# Resetting index
df = df.reset_index().drop(columns='index')
df.head(20)
| Job | Salary | Company | Location | |
|---|---|---|---|---|
| 0 | Analista de datos | $20,000 - $25,000 Mensual | SKANDIA OPERADORA DE F... | Polanco II Sección, Miguel Hidalgo, CDMX |
| 1 | Analista de Datos | Sueldo no mostrado por la empresa | GENERAL DE SEGUROS | CDMX |
| 2 | Analista de datos MDM | $17,000 - $20,000 Mensual | Grupo Alnus S.A. de C.V. | Monterrey, N.L. |
| 3 | ANALISTA DE DATOS ECOMMERCE | $16,000 Mensual | Grupo Daisa | Cuauhtémoc, CDMX |
| 4 | Analista de datos maestros | $10,000 Mensual | Empresa confidencial | Monterrey, N.L. |
| 5 | Coordinador analista de datos | Sueldo no mostrado por la empresa | Empresa confidencial | Miguel Hidalgo, CDMX |
| 6 | ANALISTA DE DATOS / EXPERTO EN EXCEL | $11,000 - $13,000 Mensual | Empresa confidencial | Zapopan, Jal. |
| 7 | Analista de Datos en Azure (6 meses) | $55,000 - $60,000 Mensual | SYGNO | Monterrey, N.L. |
| 8 | Analista de Datos (Power Bi-Tableau) | $29,000 Mensual | Empresa confidencial | CDMX |
| 9 | Analista de base de datos | $25,000 - $27,000 Mensual | SERVICIOS DE ENERGIA E... | Monterrey, N.L. |
| 10 | Analista de Bases de Datos BI | $28,000 - $29,000 Mensual | Empresa confidencial | CDMX |
| 11 | Analista de Gestión de Datos Jr. | $11,000 - $11,500 Mensual | Pepsico División Foods | Hermosillo, Son. |
| 12 | Data Analyst | Sueldo no mostrado por la empresa | Michael Page | NaN |
| 13 | Data Analyst | $60,000 - $65,000 Mensual | Alia | San José Insurgentes, Benito Juárez, CDMX |
| 14 | Data analyst | $30,000 - $35,000 Mensual | Empresa confidencial | Miguel Hidalgo, CDMX |
| 15 | Data analyst | Sueldo no mostrado por la empresa | Carl Zeiss de México, ... | CDMX |
| 16 | Portfolio Data Analyst | Sueldo no mostrado por la empresa | Banamex | CDMX |
| 17 | Data Warehouse Sr. Consultant | Sueldo no mostrado por la empresa | Banamex | CDMX |
| 18 | ANALISTA Y SOPORTE TI | $15,000 - $20,000 Mensual | Empresa confidencial | Cuajimalpa de Morelos, CDMX |
| 19 | GESTOR DE DATOS DE PRUEBA JR | Sueldo no mostrado por la empresa | Getecsa | Miguel Hidalgo, CDMX |
This section shows the process carried out to harmonize the job titles of the data-related positions and the drop of non data-related jobs.
To do so, a data jobs dictionary using keys in both English and Spanish was defined. If a certain vacancy comprised all the terms in the keys, then, the job title was renamed to a standard job title.
As a backup, a copy of the original job titles were kept in another column for comparison purposes.
# Creating a new column for keeping the original job titles
df['Original Job Title'] = df['Job']
df = df[['Original Job Title', 'Job', 'Salary', 'Company', 'Location']]
df.head()
| Original Job Title | Job | Salary | Company | Location | |
|---|---|---|---|---|---|
| 0 | Analista de datos | Analista de datos | $20,000 - $25,000 Mensual | SKANDIA OPERADORA DE F... | Polanco II Sección, Miguel Hidalgo, CDMX |
| 1 | Analista de Datos | Analista de Datos | Sueldo no mostrado por la empresa | GENERAL DE SEGUROS | CDMX |
| 2 | Analista de datos MDM | Analista de datos MDM | $17,000 - $20,000 Mensual | Grupo Alnus S.A. de C.V. | Monterrey, N.L. |
| 3 | ANALISTA DE DATOS ECOMMERCE | ANALISTA DE DATOS ECOMMERCE | $16,000 Mensual | Grupo Daisa | Cuauhtémoc, CDMX |
| 4 | Analista de datos maestros | Analista de datos maestros | $10,000 Mensual | Empresa confidencial | Monterrey, N.L. |
# Creating job dictionary
job_dict = {
('data', 'analyst'): 'Data Analyst',
('analista', 'datos'): 'Data Analyst',
('data', 'scientist'): 'Data Scientist',
('data', 'science'): 'Data Scientist',
('científico', 'datos'): 'Data Scientist',
('cientifico', 'datos'): 'Data Scientist',
('data', 'engineer'): 'Data Engineer',
('ingeniero', 'datos'): 'Data Engineer',
('data', 'architect'): 'Data Architect',
('arquitecto', 'datos'): 'Data Architect',
('business', 'intelligence'): 'BI Analyst',
('bi', 'analyst'): 'BI Analyst',
('analista', 'bi'): 'BI Analyst',
('inteligencia', 'negocios'): 'BI Analyst',
('business', 'analyst'): 'Business Analyst',
('analista', 'negocio'): 'Business Analyst',
('machine', 'learning'): 'ML Engineer',
('aprendizaje' , 'automatico'): 'ML Engineer'
}
# Proof of concept
list(job_dict.keys())[0][0] in 'data analyst' and list(job_dict.keys())[0][1] in 'data analyst'
True
# Conversion of job observations to lower case
df['Job'] = df['Job'].str.casefold()
# Cleaning of the variable Job
for i in range(len(job_dict)):
df['Job'] = df['Job'].map(lambda x: list(job_dict.values())[i] if (list(job_dict.keys())[i][0] in x and list(job_dict.keys())[i][1] in x) else x)
df.head(10)
| Original Job Title | Job | Salary | Company | Location | |
|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | $20,000 - $25,000 Mensual | SKANDIA OPERADORA DE F... | Polanco II Sección, Miguel Hidalgo, CDMX |
| 1 | Analista de Datos | Data Analyst | Sueldo no mostrado por la empresa | GENERAL DE SEGUROS | CDMX |
| 2 | Analista de datos MDM | Data Analyst | $17,000 - $20,000 Mensual | Grupo Alnus S.A. de C.V. | Monterrey, N.L. |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | $16,000 Mensual | Grupo Daisa | Cuauhtémoc, CDMX |
| 4 | Analista de datos maestros | Data Analyst | $10,000 Mensual | Empresa confidencial | Monterrey, N.L. |
| 5 | Coordinador analista de datos | Data Analyst | Sueldo no mostrado por la empresa | Empresa confidencial | Miguel Hidalgo, CDMX |
| 6 | ANALISTA DE DATOS / EXPERTO EN EXCEL | Data Analyst | $11,000 - $13,000 Mensual | Empresa confidencial | Zapopan, Jal. |
| 7 | Analista de Datos en Azure (6 meses) | Data Analyst | $55,000 - $60,000 Mensual | SYGNO | Monterrey, N.L. |
| 8 | Analista de Datos (Power Bi-Tableau) | Data Analyst | $29,000 Mensual | Empresa confidencial | CDMX |
| 9 | Analista de base de datos | Data Analyst | $25,000 - $27,000 Mensual | SERVICIOS DE ENERGIA E... | Monterrey, N.L. |
# Visual exploration of the Job variable (Top 20 positions)
number_positions = 20
fig, ax = plt.subplots(figsize = (10, 7))
sns.barplot(x=df['Job'].value_counts()[:number_positions],
y=df['Job'].value_counts().keys()[:number_positions],
hue=df['Job'].value_counts().keys()[:number_positions],
legend=False,
palette="Blues_r",
alpha= 0.9)
plt.title('Top 20 Job Titles Frequency')
plt.xlabel('Frequency')
plt.ylabel('Job Title')
plt.show()
In the plot above it is possible to observe:
# Drop of observations not corresponding to Data Jobs
df = df.loc[(df['Job'] == 'Data Analyst')
| (df['Job'] == 'Business Analyst')
| (df['Job'] == 'Data Engineer')
| (df['Job'] == 'Data Scientist')
| (df['Job'] == 'Data Architect')
| (df['Job'] == 'BI Analyst')
| (df['Job'] == 'ML Engineer')]
df['Job'].value_counts()
Job Data Analyst 186 Business Analyst 93 Data Engineer 92 Data Scientist 88 BI Analyst 70 Data Architect 29 ML Engineer 5 Name: count, dtype: int64
# Resulting number of observations
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 563 entries, 0 to 1181 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Original Job Title 563 non-null object 1 Job 563 non-null object 2 Salary 563 non-null object 3 Company 563 non-null object 4 Location 518 non-null object dtypes: object(5) memory usage: 26.4+ KB
# Visual exploration of the cleaned Job variable
fig, ax = plt.subplots(figsize = (8, 5))
sns.barplot(x=df['Job'].value_counts()[:7],
y=df['Job'].value_counts().keys()[:7],
hue=df['Job'].value_counts().keys()[:7],
legend=False,
palette="Blues_r",
alpha=0.7)
plt.title('Frequency per Job Title')
plt.xlabel('Frequency')
plt.ylabel('Job Title')
plt.show()
From the plot above, it is possible to see that the variable Job has been successfully cleaned as data-related jobs are the only observations in the data set and their job titles are consistent. It is also noteworthy that Data Analyst is the most demanded position.
This section shows the process carried out to impute the null values for a string legend of "Confidential" and to provide consistency to the letter case in the company names of the retrieved data.
Moreover, whenever the company is published in the vacancy as "Empresa Confidencial"(Spanish for confidential company), it was changed to "Confidential" too.
# Cleaning of the Company variable
df['Company'] = df['Company'].apply(lambda x: "Confidential" if pd.isnull(x) else x)
df['Company'] = df['Company'].apply(lambda x: str.title(str(x)))
df['Company'] = df['Company'].apply(lambda x: "Confidential" if x == "Empresa Confidencial" else x)
df.head(10)
| Original Job Title | Job | Salary | Company | Location | |
|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | $20,000 - $25,000 Mensual | Skandia Operadora De F... | Polanco II Sección, Miguel Hidalgo, CDMX |
| 1 | Analista de Datos | Data Analyst | Sueldo no mostrado por la empresa | General De Seguros | CDMX |
| 2 | Analista de datos MDM | Data Analyst | $17,000 - $20,000 Mensual | Grupo Alnus S.A. De C.V. | Monterrey, N.L. |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | $16,000 Mensual | Grupo Daisa | Cuauhtémoc, CDMX |
| 4 | Analista de datos maestros | Data Analyst | $10,000 Mensual | Confidential | Monterrey, N.L. |
| 5 | Coordinador analista de datos | Data Analyst | Sueldo no mostrado por la empresa | Confidential | Miguel Hidalgo, CDMX |
| 6 | ANALISTA DE DATOS / EXPERTO EN EXCEL | Data Analyst | $11,000 - $13,000 Mensual | Confidential | Zapopan, Jal. |
| 7 | Analista de Datos en Azure (6 meses) | Data Analyst | $55,000 - $60,000 Mensual | Sygno | Monterrey, N.L. |
| 8 | Analista de Datos (Power Bi-Tableau) | Data Analyst | $29,000 Mensual | Confidential | CDMX |
| 9 | Analista de base de datos | Data Analyst | $25,000 - $27,000 Mensual | Servicios De Energia E... | Monterrey, N.L. |
# Visual exploration of the cleaned Company variable
fig, ax = plt.subplots(figsize = (8, 6))
sns.barplot(x=df['Company'].value_counts()[:15],
y=df['Company'].value_counts().keys()[:15],
hue=df['Company'].value_counts().keys()[:15],
legend=False,
palette="Blues_r",
alpha = 0.8)
plt.title('Top 15 Companies Offering Data Jobs')
plt.xlabel('Frequency')
plt.ylabel('Company')
plt.show()
The plot above suggests that the letter case of the different company names is now consistent, and all the vacancies with confidential companies have been harmonized as Confidential. So, it is deemed that the variable Company has been successfully cleaned.
In this sense, it is noteworthy that most of the vacancies for data jobs correspond to confidential companies. However, from those publications whose the hiring company is not confidential, the top company demanding data jobs in Mexico is Bairesdev.
This section shows the process carried out to harmonize the name of the Mexican States in which the data-related positions are demanded.
To do so, firstly, the state component was extracted from each location observation.
# Extraction of the state from the location data
df['Location'] = df['Location'].apply(lambda x: "Remote/NA" if pd.isnull(x) else x)
df['Location'] = df['Location'].apply(lambda x: str(x).split(",")[-1] if len(str(x).split(",")) > 1 else x)
df['Location'] = df['Location'].apply(lambda x: str(x.replace('\u200b', ' ')).strip())
df.head(10)
| Original Job Title | Job | Salary | Company | Location | |
|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | $20,000 - $25,000 Mensual | Skandia Operadora De F... | CDMX |
| 1 | Analista de Datos | Data Analyst | Sueldo no mostrado por la empresa | General De Seguros | CDMX |
| 2 | Analista de datos MDM | Data Analyst | $17,000 - $20,000 Mensual | Grupo Alnus S.A. De C.V. | N.L. |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | $16,000 Mensual | Grupo Daisa | CDMX |
| 4 | Analista de datos maestros | Data Analyst | $10,000 Mensual | Confidential | N.L. |
| 5 | Coordinador analista de datos | Data Analyst | Sueldo no mostrado por la empresa | Confidential | CDMX |
| 6 | ANALISTA DE DATOS / EXPERTO EN EXCEL | Data Analyst | $11,000 - $13,000 Mensual | Confidential | Jal. |
| 7 | Analista de Datos en Azure (6 meses) | Data Analyst | $55,000 - $60,000 Mensual | Sygno | N.L. |
| 8 | Analista de Datos (Power Bi-Tableau) | Data Analyst | $29,000 Mensual | Confidential | CDMX |
| 9 | Analista de base de datos | Data Analyst | $25,000 - $27,000 Mensual | Servicios De Energia E... | N.L. |
Then, a dictionary was defined using the abbreviations in the keys and the full state names in the values. After that, the dictionary was passed to the data in order to replace the abbreviations by the full state names.
# Dictionary with the retrieved Location values
location_dict = {
'CDMX': 'Ciudad de México',
'Chih.': 'Chihuahua',
'Edo. Méx.': 'Estado de México',
'Gto.': 'Guanajuato',
'Jal.': 'Jalisco',
'Q. Roo': 'Quintana Roo',
'N. L.': 'Nuevo León',
'N.L.': 'Nuevo León',
'Pue.': 'Puebla',
'México': 'Estado de México',
'Zac.': 'Zacatecas',
'Tamps.': 'Tamaulipas',
'Mor.': 'Morelos',
'Sin.': 'Sinaloa',
'Oax.': 'Oaxaca',
'Qro.': 'Querétaro',
'Mich.': 'Michoacán',
'Son.': 'Sonora',
'BC.': 'Baja California',
'SLP.': 'San Luis Potosí',
'Yuc.': 'Yucatán',
'Coah.': 'Coahuila',
'BCS.': 'Baja California Sur',
'Nay.': 'Nayarit',
'Ags.': 'Aguascalientes',
'Hgo.': 'Hidalgo',
'Chis.': 'Chiapas',
'Ver.': 'Veracruz',
'Tab.': 'Tabasco'
}
# Cleaning of the Location variable in the dataset
for i in range(len(location_dict)):
df['Location'] = df['Location'].apply(lambda x: list(location_dict.values())[i] if x == list(location_dict.keys())[i] else x)
df.head(10)
| Original Job Title | Job | Salary | Company | Location | |
|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | $20,000 - $25,000 Mensual | Skandia Operadora De F... | Ciudad de México |
| 1 | Analista de Datos | Data Analyst | Sueldo no mostrado por la empresa | General De Seguros | Ciudad de México |
| 2 | Analista de datos MDM | Data Analyst | $17,000 - $20,000 Mensual | Grupo Alnus S.A. De C.V. | Nuevo León |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | $16,000 Mensual | Grupo Daisa | Ciudad de México |
| 4 | Analista de datos maestros | Data Analyst | $10,000 Mensual | Confidential | Nuevo León |
| 5 | Coordinador analista de datos | Data Analyst | Sueldo no mostrado por la empresa | Confidential | Ciudad de México |
| 6 | ANALISTA DE DATOS / EXPERTO EN EXCEL | Data Analyst | $11,000 - $13,000 Mensual | Confidential | Jalisco |
| 7 | Analista de Datos en Azure (6 meses) | Data Analyst | $55,000 - $60,000 Mensual | Sygno | Nuevo León |
| 8 | Analista de Datos (Power Bi-Tableau) | Data Analyst | $29,000 Mensual | Confidential | Ciudad de México |
| 9 | Analista de base de datos | Data Analyst | $25,000 - $27,000 Mensual | Servicios De Energia E... | Nuevo León |
# Visual exploration of the cleaned Location variable
fig, ax = plt.subplots(figsize = (5, 8))
sns.barplot(x=df['Location'].value_counts().values,
y=df['Location'].value_counts().index,
hue=df['Location'].value_counts().index,
legend=False,
palette="Spectral_r",
alpha = 0.9)
plt.title('Frequency per Mexican State')
plt.xlabel('Frequency')
plt.ylabel('State')
plt.show()
The plot above shows that the names of the Mexican states where the data job vacancies are demanded is now consistent. So, the variable Location has been successfully cleaned.
Moreover, it is noteworthy that Ciudad de México (Mexico City) is the location where the most of the data jobs are demanded.
This section shows the process carried out to wrangle and clean the salary data. To do so, the salary observations were cleaned from useless characters, then, the salary ranges were splitted into two columns and a average salary was calculated when applicable. Finally, non useful Salary columns were dropped.
# Checking if all the Salary observations are disclosed in a monthly basis
df[df['Salary'].apply(lambda x: True if ('Mensual' not in str(x)) and (x is not np.nan) else False) & (df['Salary'] != 'Sueldo no mostrado por la empresa')]
| Original Job Title | Job | Salary | Company | Location |
|---|
Thus, all of the observations correspond to monthly salaries.
# Replacing the wording of the positions without disclosed salaries with NaN values
df['Salary'] = df['Salary'].apply(lambda x: np.nan if x == 'Sueldo no mostrado por la empresa' else x)
df.head()
| Original Job Title | Job | Salary | Company | Location | |
|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | $20,000 - $25,000 Mensual | Skandia Operadora De F... | Ciudad de México |
| 1 | Analista de Datos | Data Analyst | NaN | General De Seguros | Ciudad de México |
| 2 | Analista de datos MDM | Data Analyst | $17,000 - $20,000 Mensual | Grupo Alnus S.A. De C.V. | Nuevo León |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | $16,000 Mensual | Grupo Daisa | Ciudad de México |
| 4 | Analista de datos maestros | Data Analyst | $10,000 Mensual | Confidential | Nuevo León |
# Removing useless characters from column values
char_remove = {
'Anual': '',
'Mensual': '',
'$': '',
',': ''}
for key, value in char_remove.items():
df['Salary'] = df['Salary'].str.replace(key, value).str.strip()
df.head()
| Original Job Title | Job | Salary | Company | Location | |
|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | 20000 - 25000 | Skandia Operadora De F... | Ciudad de México |
| 1 | Analista de Datos | Data Analyst | NaN | General De Seguros | Ciudad de México |
| 2 | Analista de datos MDM | Data Analyst | 17000 - 20000 | Grupo Alnus S.A. De C.V. | Nuevo León |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | 16000 | Grupo Daisa | Ciudad de México |
| 4 | Analista de datos maestros | Data Analyst | 10000 | Confidential | Nuevo León |
# Splitting Salary column values into two columns
df[['Min Salary', 'Max Salary']] = df['Salary'].str.split('-', expand=True)
df['Min Salary'] = df['Min Salary'].str.strip().astype(np.float64)
df['Max Salary'] = df['Max Salary'].str.strip().astype(np.float64)
df.head()
| Original Job Title | Job | Salary | Company | Location | Min Salary | Max Salary | |
|---|---|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | 20000 - 25000 | Skandia Operadora De F... | Ciudad de México | 20000.0 | 25000.0 |
| 1 | Analista de Datos | Data Analyst | NaN | General De Seguros | Ciudad de México | NaN | NaN |
| 2 | Analista de datos MDM | Data Analyst | 17000 - 20000 | Grupo Alnus S.A. De C.V. | Nuevo León | 17000.0 | 20000.0 |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | 16000 | Grupo Daisa | Ciudad de México | 16000.0 | NaN |
| 4 | Analista de datos maestros | Data Analyst | 10000 | Confidential | Nuevo León | 10000.0 | NaN |
# Calculating average salary for each vacancy
df['Avg Salary'] = (df['Min Salary'] + df['Max Salary']) / 2
df.head()
| Original Job Title | Job | Salary | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | 20000 - 25000 | Skandia Operadora De F... | Ciudad de México | 20000.0 | 25000.0 | 22500.0 |
| 1 | Analista de Datos | Data Analyst | NaN | General De Seguros | Ciudad de México | NaN | NaN | NaN |
| 2 | Analista de datos MDM | Data Analyst | 17000 - 20000 | Grupo Alnus S.A. De C.V. | Nuevo León | 17000.0 | 20000.0 | 18500.0 |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | 16000 | Grupo Daisa | Ciudad de México | 16000.0 | NaN | NaN |
| 4 | Analista de datos maestros | Data Analyst | 10000 | Confidential | Nuevo León | 10000.0 | NaN | NaN |
It seems that some salary data is only in the Min Salary column. Those observations correspond to the vacancies in which the published salary was not a range but a single value. Thus, said value was copied into the Max Salary and Avg Salary columns.
# Imputing NaN values in the Avg Salary column with the values in the Min Salary column
mask = pd.isna(df['Avg Salary']) & ~pd.isna(df['Min Salary'])
df.loc[mask, 'Avg Salary'] = df['Min Salary'] # Using loc to avoid the SettingWithCopyWarning
df.loc[mask, 'Max Salary'] = df['Min Salary'] # Using loc to avoid the SettingWithCopyWarning
df.head()
| Original Job Title | Job | Salary | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | 20000 - 25000 | Skandia Operadora De F... | Ciudad de México | 20000.0 | 25000.0 | 22500.0 |
| 1 | Analista de Datos | Data Analyst | NaN | General De Seguros | Ciudad de México | NaN | NaN | NaN |
| 2 | Analista de datos MDM | Data Analyst | 17000 - 20000 | Grupo Alnus S.A. De C.V. | Nuevo León | 17000.0 | 20000.0 | 18500.0 |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | 16000 | Grupo Daisa | Ciudad de México | 16000.0 | 16000.0 | 16000.0 |
| 4 | Analista de datos maestros | Data Analyst | 10000 | Confidential | Nuevo León | 10000.0 | 10000.0 | 10000.0 |
# Visual exploration of the cleaned Salary variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.histplot(data=df['Avg Salary'],
color=sns.color_palette('Blues_r')[0],
alpha=0.6)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.show()
The histogram above suggests that all Salary observations have been transformed from a range in a string format into actual numbers. So, it is considered that the variable Salary has been successfully cleaned. On the other hand, it is noteworthy that some observations fall well beyond the main salary distribution. Let's take a look into those observations.
# Exploring positions whose salary is more than $70,000 MXN per month
df[df['Avg Salary'] > 70000]
| Original Job Title | Job | Salary | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|---|---|
| 48 | Analista de datos sr. | Data Analyst | 70000 - 80000 | Confidential | Ciudad de México | 70000.0 | 80000.0 | 75000.0 |
| 52 | Analista de datos Sr. | Data Analyst | 80000 | Confidential | Ciudad de México | 80000.0 | 80000.0 | 80000.0 |
| 135 | SQL Server Data Analyst | Data Analyst | 70000 - 85000 | Caspex Corp | Remote/NA | 70000.0 | 85000.0 | 77500.0 |
| 199 | Científico de Datos | Data Scientist | 90000 | Ecosistemex S. De R.L.... | Ciudad de México | 90000.0 | 90000.0 | 90000.0 |
| 225 | Sr Data Scientist | Data Scientist | 70000 - 80000 | Enterprise Solutions, ... | Ciudad de México | 70000.0 | 80000.0 | 75000.0 |
| 281 | Sr Data Engineer | Data Engineer | 75000 - 80000 | Addon Technologies Inc. | Jalisco | 75000.0 | 80000.0 | 77500.0 |
| 427 | Data engineer (Databricks) | Data Engineer | 80000 - 90000 | Manpower, S.A. De C.V. | Estado de México | 80000.0 | 90000.0 | 85000.0 |
| 428 | Data Engineer (Databricks) | Data Engineer | 80000 - 90000 | Manpower, S.A. De C.V. | Estado de México | 80000.0 | 90000.0 | 85000.0 |
| 448 | Data Engineer with Fivetran developer | Data Engineer | 70000 - 80000 | Enterprise Solutions, ... | Remote/NA | 70000.0 | 80000.0 | 75000.0 |
| 479 | Data Architect | Data Architect | 70000 - 80000 | Softtek | Remote/NA | 70000.0 | 80000.0 | 75000.0 |
From the above results, it is plausible that Data Engineer and Data Architect positions earns an extremely high salaries as the ones showed as such positions require advanced programming skills and expertise. However, it is interesting that some Sr. Data Analyst and Sr. Data Scientists positions made it to the top.
Finally, the original Salary column was dropped.
# Dropping of the original Salary column
df = df.drop(columns = ['Salary'])
df.head()
| Original Job Title | Job | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | Skandia Operadora De F... | Ciudad de México | 20000.0 | 25000.0 | 22500.0 |
| 1 | Analista de Datos | Data Analyst | General De Seguros | Ciudad de México | NaN | NaN | NaN |
| 2 | Analista de datos MDM | Data Analyst | Grupo Alnus S.A. De C.V. | Nuevo León | 17000.0 | 20000.0 | 18500.0 |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | Grupo Daisa | Ciudad de México | 16000.0 | 16000.0 | 16000.0 |
| 4 | Analista de datos maestros | Data Analyst | Confidential | Nuevo León | 10000.0 | 10000.0 | 10000.0 |
df.describe()
| Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|
| count | 197.000000 | 197.000000 | 197.000000 |
| mean | 30114.121827 | 34211.827411 | 32162.974619 |
| std | 18295.173638 | 20662.664676 | 19417.175845 |
| min | 4000.000000 | 4000.000000 | 4000.000000 |
| 25% | 16000.000000 | 19000.000000 | 17500.000000 |
| 50% | 25000.000000 | 30000.000000 | 27500.000000 |
| 75% | 40000.000000 | 45000.000000 | 42000.000000 |
| max | 90000.000000 | 90000.000000 | 90000.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 563 entries, 0 to 1181 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Original Job Title 563 non-null object 1 Job 563 non-null object 2 Company 563 non-null object 3 Location 563 non-null object 4 Min Salary 197 non-null float64 5 Max Salary 197 non-null float64 6 Avg Salary 197 non-null float64 dtypes: float64(3), object(4) memory usage: 51.4+ KB
So, the dataset has been successfully cleaned and prepared for the Data Analysis & Visualization.
The processed dataset was exported to CSV.
# Exporting the processed dataset
df.to_csv('Dataset_processed.csv', index=False, encoding='utf-8')
The second dataframe with only those vacancies whose salary have been disclosed was built in the present section.
# Second dataframe with only disclosed salary data
salary_df = df.dropna(axis = 0, how='any', subset = ['Avg Salary'])
salary_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 197 entries, 0 to 1067 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Original Job Title 197 non-null object 1 Job 197 non-null object 2 Company 197 non-null object 3 Location 197 non-null object 4 Min Salary 197 non-null float64 5 Max Salary 197 non-null float64 6 Avg Salary 197 non-null float64 dtypes: float64(3), object(4) memory usage: 12.3+ KB
salary_df.describe()
| Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|
| count | 197.000000 | 197.000000 | 197.000000 |
| mean | 30114.121827 | 34211.827411 | 32162.974619 |
| std | 18295.173638 | 20662.664676 | 19417.175845 |
| min | 4000.000000 | 4000.000000 | 4000.000000 |
| 25% | 16000.000000 | 19000.000000 | 17500.000000 |
| 50% | 25000.000000 | 30000.000000 | 27500.000000 |
| 75% | 40000.000000 | 45000.000000 | 42000.000000 |
| max | 90000.000000 | 90000.000000 | 90000.000000 |
salary_df.head(10)
| Original Job Title | Job | Company | Location | Min Salary | Max Salary | Avg Salary | |
|---|---|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | Skandia Operadora De F... | Ciudad de México | 20000.0 | 25000.0 | 22500.0 |
| 2 | Analista de datos MDM | Data Analyst | Grupo Alnus S.A. De C.V. | Nuevo León | 17000.0 | 20000.0 | 18500.0 |
| 3 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | Grupo Daisa | Ciudad de México | 16000.0 | 16000.0 | 16000.0 |
| 4 | Analista de datos maestros | Data Analyst | Confidential | Nuevo León | 10000.0 | 10000.0 | 10000.0 |
| 6 | ANALISTA DE DATOS / EXPERTO EN EXCEL | Data Analyst | Confidential | Jalisco | 11000.0 | 13000.0 | 12000.0 |
| 7 | Analista de Datos en Azure (6 meses) | Data Analyst | Sygno | Nuevo León | 55000.0 | 60000.0 | 57500.0 |
| 8 | Analista de Datos (Power Bi-Tableau) | Data Analyst | Confidential | Ciudad de México | 29000.0 | 29000.0 | 29000.0 |
| 9 | Analista de base de datos | Data Analyst | Servicios De Energia E... | Nuevo León | 25000.0 | 27000.0 | 26000.0 |
| 10 | Analista de Bases de Datos BI | Data Analyst | Confidential | Ciudad de México | 28000.0 | 29000.0 | 28500.0 |
| 11 | Analista de Gestión de Datos Jr. | Data Analyst | Pepsico División Foods | Sonora | 11000.0 | 11500.0 | 11250.0 |
After the dataset was cleaned and prepared, the data was analyzed and visualized in order to answer the following questions:
To answer these questions, different plots and charts were drawn using several libraries such as Matplotlib, Seaborn, Folium, and Plotly, with the purpose of testing them and selecting for reporting the most pleasant and aesthetic results.
On the other hand, a specific directory named "Figures" was used to store the relevant plots created during the analysis.
if not os.path.exists("Figures"):
os.mkdir("Figures")
To answer this question, a donut chart was plotted using the resulting series from the value_counts method.
The idea was to show the different data jobs categories as the arcs in the donut chart, in which the long of the arcs represents the demand in the Mexican job market.
The donut charts were drawn using both Matplotlib and Plotly.
# Color palette for pie charts
pie_colors = ['#154360','#539ecd','#89bedc',"#a9cce3", "#d4e6f1",'#dbe9f6', "#ebf5fb"]
# Donut chart of the data jobs demand with Matplotlib
fig, ax = plt.subplots(figsize = (8, 13))
explode = [0.02] * 7
wedges, texts, autotexts = ax.pie(x = list(df['Job'].value_counts().values),
wedgeprops=dict(edgecolor='w', linewidth= 1),
textprops=dict(size=17, weight="bold", color = 'dimgray'),
colors=pie_colors,
autopct='%.0f%%',
pctdistance=1.15,
startangle = 90,
counterclock = False,
explode = explode)
center = plt.Circle( (0,0), 0.7, color='white')
p = plt.gcf()
p.gca().add_artist(center)
plt.legend(wedges,
list(df['Job'].value_counts().keys()),
fontsize=13,
#loc="center right",
bbox_to_anchor=(1, 0, 0.5, 0.75))
plt.tight_layout()
plt.savefig('Figures/Fig1_DemandOfDataJobsPerCategory.png', bbox_inches = 'tight')
plt.show()
# Donut chart of the data jobs demand with Plotly
job_df = pd.DataFrame(df['Job'].value_counts().reset_index().rename(columns = {'count': 'Count'}))
fig = px.pie(job_df,
values='Count',
names='Job',
color='Job',
hole=0.7,
color_discrete_sequence=px.colors.sequential.Blues_r,
height=500,
width=700,
title='<b>Demand of Data Jobs Per Category</b>')
fig.update_layout(title_x=0.5,
font=font_px,
legend = legend_px
)
fig.update_traces(hoverinfo='label+percent+name',
textfont_size=18,
textinfo='percent',
texttemplate='%{percent:.0%}',
direction ='clockwise',
marker=dict(colors=pie_colors, line=dict(color='white', width=1)))
fig.write_image("Figures/Fig1_DemandOfDataJobsPerCategory2.png", scale=2)
fig.write_image("Figures/Fig1_DemandOfDataJobsPerCategory2.svg", scale=2)
fig.show(config=config)
From the above plot, it is possible to conclude that a third of the data jobs (33%) correspond to positions of Data Analyst, rendering them as the most demanded ones in the Mexican labor market at the time of this study. In second place, it is possible to found the positions of Business Analyst with about 17% of the total.
Furthermore, it is noteworthy that Data Engineer and Data Scientists positions are more demanded than Business Intelligence Analyst positions with about 16%, about 16% and about 12% of the demand in the country, respectively.
On the contrary, Data Architect and Machine Learning Engineers positions are the less demanded, with only about 5% and about 1% out of the total, respectively.
To answer this question, a choropleth was prepared using Folium and Plotly, in which the intensity of the color was proportional to the demand of data jobs per state.
First, to define the specific location of the markers in the Folium map, a dictionary was defined with the corresponding ID, Latitude and Longitude of the capital cities for each Mexican State.
The IDs correspond to the IDs of the file with the geographical data (Arroyo-Velázquez, 2022).
Then, the dictionary was converted into a Pandas dataframe.
# Mexican states dictionary with corresponding ID, Latitude and Longitude
states_dict = {'Aguascalientes': ('AS', 21.87945992, -102.2904135),
'Baja California': ('BC', 32.663214,-115.4903741),
'Baja California Sur': ('BS', 24.1584937,-110.315928),
'Campeche': ('CC', 19.8450352,-90.5381231),
'Chiapas': ('CS', 16.7541485,-93.119001),
'Chihuahua': ('CH', 28.6349557,-106.0777049),
'Coahuila': ('CL', 25.4286965,-100.9994484),
'Colima': ('CM', 19.2408324,-103.7291389),
'Ciudad de México': ('DF', 19.4335493,-99.1344048),
'Durango': ('DG', 24.0241017,-104.6708325),
'Guanajuato': ('GT', 21.0176446,-101.2586863),
'Guerrero': ('GR', 17.5516921,-99.5025877),
'Hidalgo': ('HG', 20.1183855,-98.7540094),
'Jalisco': ('JC', 20.6773775,-103.3494204),
'Estado de México': ('MC', 19.289191,-99.6670425),
'Michoacán': ('MN', 19.7030535,-101.1937953),
'Morelos': ('MS', 18.9218499,-99.2353856),
'Nayarit': ('NT', 21.5122308,-104.8948845),
'Nuevo León': ('NL', 25.6717637,-100.3163831),
'Oaxaca': ('OC', 17.0617935,-96.7271634),
'Puebla': ('PL', 19.0428817,-98.2002919),
'Querétaro': ('QT', 20.37998212, -100.0000308),
'Quintana Roo': ('QR', 18.4978052,-88.3029951),
'San Luis Potosí': ('SP', 22.1521646,-100.9765552),
'Sinaloa': ('SL', 24.8082702,-107.3945828),
'Sonora': ('SR', 29.0748734,-110.9597578),
'Tabasco': ('TC', 17.9882632,-92.9209807),
'Tamaulipas': ('TS', 23.7312703,-99.1517694),
'Tlaxcala': ('TL', 19.3171271,-98.2386354),
'Veracruz': ('VZ', 19.5269375,-96.92401),
'Yucatán': ('YN', 20.9664386,-89.623114),
'Zacatecas': ('ZS', 22.7753476,-102.5740002)}
states_df = pd.DataFrame.from_dict(states_dict, orient='index').reset_index().\
rename(columns={"index": "State", 0: "ID", 1: "Lat", 2: "Long"}).set_index('State')
states_df.head()
| ID | Lat | Long | |
|---|---|---|---|
| State | |||
| Aguascalientes | AS | 21.879460 | -102.290413 |
| Baja California | BC | 32.663214 | -115.490374 |
| Baja California Sur | BS | 24.158494 | -110.315928 |
| Campeche | CC | 19.845035 | -90.538123 |
| Chiapas | CS | 16.754148 | -93.119001 |
After that, with the purpose of providing a color to each Mexican state in the choropleth, the percentage of data jobs for each state was calculated using the value_counts method. For those states without data jobs, a join with the previously defined states_df dataframe was performed and the NaN values were filled with zeros.
# Calculating the Demand Percentage per State
demand_by_state_df = pd.DataFrame(df['Location'].value_counts())
total = sum(demand_by_state_df['count'])
demand_by_state_df['Percentage'] = (demand_by_state_df['count']) / total *100
demand_by_state_df = demand_by_state_df.reset_index().rename(columns={"Location": "State", "count": "Count"})
demand_by_state_df = states_df.merge(demand_by_state_df, left_on='State', right_on='State', how = 'outer').fillna(0)
demand_by_state_df
| State | ID | Lat | Long | Count | Percentage | |
|---|---|---|---|---|---|---|
| 0 | Aguascalientes | AS | 21.879460 | -102.290413 | 4.0 | 0.710480 |
| 1 | Baja California | BC | 32.663214 | -115.490374 | 4.0 | 0.710480 |
| 2 | Baja California Sur | BS | 24.158494 | -110.315928 | 1.0 | 0.177620 |
| 3 | Campeche | CC | 19.845035 | -90.538123 | 0.0 | 0.000000 |
| 4 | Chiapas | CS | 16.754148 | -93.119001 | 0.0 | 0.000000 |
| 5 | Chihuahua | CH | 28.634956 | -106.077705 | 11.0 | 1.953819 |
| 6 | Coahuila | CL | 25.428697 | -100.999448 | 0.0 | 0.000000 |
| 7 | Colima | CM | 19.240832 | -103.729139 | 0.0 | 0.000000 |
| 8 | Ciudad de México | DF | 19.433549 | -99.134405 | 320.0 | 56.838366 |
| 9 | Durango | DG | 24.024102 | -104.670833 | 0.0 | 0.000000 |
| 10 | Guanajuato | GT | 21.017645 | -101.258686 | 5.0 | 0.888099 |
| 11 | Guerrero | GR | 17.551692 | -99.502588 | 0.0 | 0.000000 |
| 12 | Hidalgo | HG | 20.118385 | -98.754009 | 0.0 | 0.000000 |
| 13 | Jalisco | JC | 20.677377 | -103.349420 | 51.0 | 9.058615 |
| 14 | Estado de México | MC | 19.289191 | -99.667042 | 22.0 | 3.907638 |
| 15 | Michoacán | MN | 19.703053 | -101.193795 | 1.0 | 0.177620 |
| 16 | Morelos | MS | 18.921850 | -99.235386 | 0.0 | 0.000000 |
| 17 | Nayarit | NT | 21.512231 | -104.894885 | 1.0 | 0.177620 |
| 18 | Nuevo León | NL | 25.671764 | -100.316383 | 51.0 | 9.058615 |
| 19 | Oaxaca | OC | 17.061794 | -96.727163 | 2.0 | 0.355240 |
| 20 | Puebla | PL | 19.042882 | -98.200292 | 6.0 | 1.065719 |
| 21 | Querétaro | QT | 20.379982 | -100.000031 | 11.0 | 1.953819 |
| 22 | Quintana Roo | QR | 18.497805 | -88.302995 | 0.0 | 0.000000 |
| 23 | San Luis Potosí | SP | 22.152165 | -100.976555 | 7.0 | 1.243339 |
| 24 | Sinaloa | SL | 24.808270 | -107.394583 | 6.0 | 1.065719 |
| 25 | Sonora | SR | 29.074873 | -110.959758 | 2.0 | 0.355240 |
| 26 | Tabasco | TC | 17.988263 | -92.920981 | 1.0 | 0.177620 |
| 27 | Tamaulipas | TS | 23.731270 | -99.151769 | 1.0 | 0.177620 |
| 28 | Tlaxcala | TL | 19.317127 | -98.238635 | 0.0 | 0.000000 |
| 29 | Veracruz | VZ | 19.526937 | -96.924010 | 2.0 | 0.355240 |
| 30 | Yucatán | YN | 20.966439 | -89.623114 | 9.0 | 1.598579 |
| 31 | Zacatecas | ZS | 22.775348 | -102.574000 | 0.0 | 0.000000 |
| 32 | Remote/NA | 0 | 0.000000 | 0.000000 | 45.0 | 7.992895 |
For the Folium map, a dictionary was defined to assign a specific color for each type of data job.
# Dictionary to map job colors
job_colors = {
'Data Analyst': 'blue',
'Business Analyst': 'green',
'Data Engineer': 'violet',
'Data Scientist': 'red',
'Data Architect': 'yellow',
'BI Analyst': 'orange',
'ML Engineer': 'black'
}
Later, each vacancy was colored using the dictionary above to map job color.
# Join of the job data dataframe with the Mexican States latitude and longitud
jobs_loc = df.merge(states_df, left_on='Location', right_on='State', how = 'inner')
jobs_loc['Marker Color'] = jobs_loc['Job'].map(job_colors)
jobs_loc.head()
| Original Job Title | Job | Company | Location | Min Salary | Max Salary | Avg Salary | ID | Lat | Long | Marker Color | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Analista de datos | Data Analyst | Skandia Operadora De F... | Ciudad de México | 20000.0 | 25000.0 | 22500.0 | DF | 19.433549 | -99.134405 | blue |
| 1 | Analista de Datos | Data Analyst | General De Seguros | Ciudad de México | NaN | NaN | NaN | DF | 19.433549 | -99.134405 | blue |
| 2 | ANALISTA DE DATOS ECOMMERCE | Data Analyst | Grupo Daisa | Ciudad de México | 16000.0 | 16000.0 | 16000.0 | DF | 19.433549 | -99.134405 | blue |
| 3 | Coordinador analista de datos | Data Analyst | Confidential | Ciudad de México | NaN | NaN | NaN | DF | 19.433549 | -99.134405 | blue |
| 4 | Analista de Datos (Power Bi-Tableau) | Data Analyst | Confidential | Ciudad de México | 29000.0 | 29000.0 | 29000.0 | DF | 19.433549 | -99.134405 | blue |
Then, the choropleth using Folium was created, and the vacancy markers were added.
The JSON file with the geographical data was taken from Arroyo-Velázquez (2022).
# Initialization of Folium map
m = folium.Map(location=[24,-105], zoom_start=5)
# Setting of the Choropleth
folium.Choropleth(
geo_data = 'https://raw.githubusercontent.com/isaacarroyov/data_visualization_practice/master/Python/visualizing_mexican_wildfires_tds/data/states_mx.json',
name = "Data Jobs Localization",
data = demand_by_state_df,
columns = ["ID", "Percentage"],
key_on = "feature.id",
fill_color = "Blues",
fill_opacity = 0.7,
line_opacity = 0.1,
legend_name = "Data Jobs Demand (%)",
).add_to(m)
marker_cluster = MarkerCluster()
m.add_child(marker_cluster)
# Adding of individual markers to clusters
for index, record in jobs_loc.iterrows():
marker = folium.Marker(list(record[8:10]),
icon = folium.Icon(color='white', icon_color=str(record[-1])))
marker_cluster.add_child(marker)
# Setting of the Layer Control
folium.LayerControl().add_to(m)
# Showing the map
m
Finally, a choropleth map with Plotly was drawn using the same dataframe.
# Choropleth map with Plotly
fig = px.choropleth(demand_by_state_df,
geojson = 'https://raw.githubusercontent.com/isaacarroyov/data_visualization_practice/master/Python/visualizing_mexican_wildfires_tds/data/states_mx.json',
locations='ID',
color='Percentage',
color_continuous_scale="Blues",
scope="north america",
#title='Demand of Data Jobs per Mexican State',
labels={'Percentage': 'National<br>Demand %'},
height= 500,
width = 800
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0}, title_x=0.5, font=font_px)
fig.update_geos(fitbounds="locations", visible=False)
fig.write_image("Figures/Fig2_DemandOfDataJobsPerMexicanState.png", scale=2)
fig.write_image("Figures/Fig2_DemandOfDataJobsPerMexicanState.svg", scale=2)
fig.show(config=config)
In conclusion, the above maps strongly suggest that most of the data jobs are concentrated in the capital, Mexico City, with more than 50% of the demand at the moment of this study.
On the other hand, according to the data, Nuevo León, and Jalisco represent distant second places, with only about 10% of the total demand each.
To answer this question, a stacked bar plot and a heatmap were drawn.
First, a dataframe was prepared using the pivot_table function from Pandas using the location as the index. The vacancies without specific locations disclosed were removed.
The stacked bar plots were drawn using both Matplotlib and Plotly; whereas the heatmap were drawn using both Seaborn and Plotly.
# Preparing dataframe of jobs per location
jobs_per_location_df = (df.pivot_table(index = 'Location', columns = 'Job', values = 'Company', aggfunc = 'count')
.fillna(0).drop("Remote/NA"))
jobs_per_location_df['Total'] = jobs_per_location_df.sum(axis=1, numeric_only= True)
jobs_per_location_df.head()
| Job | BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer | Total |
|---|---|---|---|---|---|---|---|---|
| Location | ||||||||
| Aguascalientes | 0.0 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 4.0 |
| Baja California | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 4.0 |
| Baja California Sur | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| Chihuahua | 8.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 |
| Ciudad de México | 37.0 | 54.0 | 101.0 | 18.0 | 50.0 | 58.0 | 2.0 | 320.0 |
As most of the data jobs concentrates in few locations, the top 5 were used to create a new dataframe to be used to draw the bar plots.
# Top 5 locations
top_locations = 5
jobs_per_location_bar_plot_df = (jobs_per_location_df.sort_values('Total', ascending = False)[:top_locations]
.sort_values('Total', ascending = True).drop(columns = 'Total')
)
jobs_per_location_bar_plot_df.head()
| Job | BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer |
|---|---|---|---|---|---|---|---|
| Location | |||||||
| Querétaro | 2.0 | 3.0 | 5.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| Estado de México | 1.0 | 4.0 | 12.0 | 0.0 | 4.0 | 1.0 | 0.0 |
| Nuevo León | 5.0 | 6.0 | 22.0 | 0.0 | 11.0 | 7.0 | 0.0 |
| Jalisco | 5.0 | 10.0 | 16.0 | 3.0 | 8.0 | 6.0 | 3.0 |
| Ciudad de México | 37.0 | 54.0 | 101.0 | 18.0 | 50.0 | 58.0 | 2.0 |
# Order of the data jobs according to demand
data_jobs_order = list(df['Job'].value_counts().keys())
data_jobs_order
['Data Analyst', 'Business Analyst', 'Data Engineer', 'Data Scientist', 'BI Analyst', 'Data Architect', 'ML Engineer']
# Bar plot of Jobs by State with Matplotlib
# Stack bar chart of Data jobs per State
jobs_per_location_bar_plot_df.plot.barh(stacked=True,
y = data_jobs_order,
color=sns.color_palette('Blues_r',7),
figsize=(8, 5),
width = 0.85 )
plt.legend(facecolor = 'white', loc = 'center right', title = 'Data Job Category')
plt.xlabel('Vacancies', weight = 'bold')
plt.ylabel('Location', weight = 'bold')
plt.title('Demand Per Data Job Category In Top Locations\n')
plt.savefig('Figures/Fig3_DemandPerDataJobCategoryInTopLocations1.png', bbox_inches = 'tight')
plt.show()
# Bar plot of Jobs by State with Plotly
fig = px.bar(jobs_per_location_bar_plot_df,
x=data_jobs_order,
y=jobs_per_location_bar_plot_df.index,
color_discrete_sequence=px.colors.sequential.Blues_r,
height = 500,
width = 800,
title = '<b>Demand Per Data Job Category In Top Locations</b>',
labels = {"value": "<b>Vacancies</b>",
"variable": "Data Job Category",
"Location":"<b>Location</b>"},
barmode='stack'#'group'
)
fig.update_layout(title_x=0.5, font=font_px, legend=legend_px)
fig.write_image("Figures/Fig3_DemandPerDataJobCategoryInTopLocations2.png", scale=2)
fig.write_image("Figures/Fig3_DemandPerDataJobCategoryInTopLocations2.svg", scale=2)
fig.show(config=config)
To draw the heatmaps, a new dataframe was built using the previous dataframe joined with the dataframe with the states data. This, with the purpose to draw the heapmap with all the states from Mexico, and not only with those with current vacancies.
# Building Dataframe for Heatmap
jobs_per_location_heatmap_df = (jobs_per_location_df.join(states_df, how = 'outer').fillna(0)
.drop(columns = ['Total','ID', 'Lat', 'Long'])
)
jobs_per_location_heatmap_df.head()
| BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer | |
|---|---|---|---|---|---|---|---|
| Aguascalientes | 0.0 | 2.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| Baja California | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| Baja California Sur | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Campeche | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Chiapas | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
# Heatmap of Data Jobs per State with Seaborn
plt.figure(figsize = (8,10))
ax = sns.heatmap(jobs_per_location_heatmap_df,
cmap = "Blues_r",
mask = (jobs_per_location_heatmap_df == 0))
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel('Location')
plt.xticks(rotation = 330)
plt.title('Demand Per Location & Data Job Category\n')
plt.grid(False)
plt.savefig('Figures/Fig4_DemandPerLocationAndDataJobCategory1.png', bbox_inches = 'tight')
plt.show()
Heatmaps with Plotly require the dataset to be in a tidy (or long) form. So, the data was transformed once again using the method melt from Pandas.
# Converting the Dataset into a tidy format
jobs_per_location_heatmap_tidy_df = pd.melt(jobs_per_location_heatmap_df.reset_index().rename(columns = {'index': 'Location'}),
id_vars= 'Location', var_name = 'Job', value_name = 'Vacancies').\
sort_values(by = 'Location', ascending = False)
jobs_per_location_heatmap_tidy_df.head(10)
| Location | Job | Vacancies | |
|---|---|---|---|
| 223 | Zacatecas | ML Engineer | 0.0 |
| 95 | Zacatecas | Data Analyst | 0.0 |
| 63 | Zacatecas | Business Analyst | 0.0 |
| 159 | Zacatecas | Data Engineer | 0.0 |
| 127 | Zacatecas | Data Architect | 0.0 |
| 191 | Zacatecas | Data Scientist | 0.0 |
| 31 | Zacatecas | BI Analyst | 0.0 |
| 158 | Yucatán | Data Engineer | 1.0 |
| 62 | Yucatán | Business Analyst | 3.0 |
| 126 | Yucatán | Data Architect | 0.0 |
# Heatmap of Data Jobs per State with Plotly
# Heatmap with Plotly
fig = px.density_heatmap(jobs_per_location_heatmap_tidy_df,
y='Location',
x = 'Job',
z = 'Vacancies',
color_continuous_scale= heatmap_px_colorscale,
#color_continuous_scale="Blues_r",
height=1000,
width=800,
title= '<b>Demand Per Location & Data Job Category</b>',
labels={"Job": "<b>Data Job Category</b>",
"Location":"<b>Location</b>"},
)
fig.update_layout(title_x=0.5, coloraxis_colorbar=dict(title="<b>Vacancies</b>"),
font=font_px)
fig.update_xaxes(categoryorder='array', categoryarray= heatmap_px_categoryarray)
fig.write_image("Figures/Fig4_DemandPerLocationAndDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig4_DemandPerLocationAndDataJobCategory2.svg", scale=2)
fig.show(config=config)
From the plots above, it is possible to observe that, undoubtedly, Mexico City, Jalisco and Nuevo León are the locations where most of the data jobs are demanded, while the rest of the country is lagging behind in terms of data jobs creation.
However, it is important to remark that the data jobs demand is largely concentrated in Mexico City, whereas Jalisco, Nuevo León, and Estado de México are distant second, third and fourth places, respectively.
Moreover, it is noteworthy that the Data Analyst position is the one most demanded across the Mexican States; whereas Data Architect and ML Engineer are the less demanded, as they are mostly concentrated in Mexico City and Jalisco.
To answer this question, a treemap was drawn using Plotly in which the size of the squares was proportional to the number of vacancies published by a company.
So, first, a new dataframe was created using the groupby and count methods from Pandas. Then, the vacancies published under the Confidential umbrella were removed. After that, the datarame was sliced to get only the top 15 companies demanding data jobs.
Finally, the labels for the treemap were also sliced to shorten them, and provide the treemap with a more aesthetic outlook.
# Building the Dataframe with the top 15 companies demanding data jobs
top_companies = 15
top_companies_df = (df.groupby(by = 'Company', as_index= False)['Job'].count()
.sort_values(by = 'Job', ascending = False)
.rename(columns = {'Job': 'Vacancies'})[:top_companies]
.set_index('Company').drop('Confidential').reset_index()
)
top_companies_df['Company'] = top_companies_df['Company'].apply(lambda x: x[:16])
top_companies_df.head()
| Company | Vacancies | |
|---|---|---|
| 0 | Bairesdev Llc / | 49 |
| 1 | Banamex | 20 |
| 2 | Pepsico División | 15 |
| 3 | Softtek | 9 |
| 4 | Michael Page | 8 |
# Treemap with Plotly
fig = px.treemap(top_companies_df,
values='Vacancies',
path = [px.Constant("."), 'Company'],
color = 'Vacancies',
color_continuous_scale=px.colors.sequential.Blues,
title= f'<b>Top {top_companies} Companies Demanding Data Jobs</b>',
height= 600,
width = 1000
)
fig.update_layout(title_x=0.5,
coloraxis_colorbar=dict(title="<b>Vacancies</b>"),
font=font_px)
fig.write_image("Figures/Fig5_TopCompaniesDemandingDataJobs.png", scale=2)
fig.write_image("Figures/Fig5_TopCompaniesDemandingDataJobs.svg", scale=2)
fig.show(config=config)
From the plot above, it is clear that Bairesdev, Banamex, Pepsico and Softek are the companies with the highest data jobs demand at the moment of the present study. So, even though, the data jobs demand may vary along time, the current interest of such companies in data science and analytics might convert them in interesting prospects for job seekers.
It is also noteworthy that most of the top companies demanding data jobs are recruitment agencies, tech consultancy firms, and banks.
To answer this question, a heatmap was drawn in order to show what data jobs categories are demanded by company, in which the lighter the color, the larger the demand.
As usual, a dataframe with the data for top 30 companies demanding data jobs was prepared using the method pivot_table from Pandas.
Of course, the vacancies without a disclosed company were removed from the dataset.
Heatmaps were drawn using both Seaborn and Plotly.
# Dataframe for Heatmap with top 30 companies demanding data jobs
top_companies_data_jobs = 30
jobs_per_company_df = (pd.pivot_table(data = df, index = 'Company', columns = 'Job',
values = 'Location', aggfunc = 'count')
.fillna(0).reset_index())
jobs_per_company_df['Total'] = jobs_per_company_df.sum(axis=1, numeric_only= True)
jobs_per_company_df = (jobs_per_company_df.sort_values('Total', ascending = False)
.set_index('Company').drop('Confidential')[:top_companies_data_jobs]
.drop(columns = 'Total').sort_values('Company', ascending = True)
)
jobs_per_company_df.head()
| Job | BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer |
|---|---|---|---|---|---|---|---|
| Company | |||||||
| Adecco | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 0.0 |
| Bairesdev Llc / (Tax I... | 2.0 | 19.0 | 12.0 | 2.0 | 8.0 | 4.0 | 2.0 |
| Banamex | 1.0 | 6.0 | 10.0 | 0.0 | 0.0 | 3.0 | 0.0 |
| Bancoppel | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 5.0 | 0.0 |
| Banorte | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 |
# Heatmap with Seaborn
plt.figure(figsize = (8,10))
formatter = mpl.ticker.StrMethodFormatter('{x:,.0f}')
ax = sns.heatmap(jobs_per_company_df,
cmap = "Blues_r",
mask = (jobs_per_company_df == 0),
cbar_kws={"format": formatter})
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel(f'Company')
plt.xticks(rotation = 330)
plt.title(f'Demand Per Company (Top {top_companies_data_jobs}) & Data Job Category\n')
plt.grid(False)
plt.savefig('Figures/Fig6_DemandPerCompanyAndDataJobCategory.png', bbox_inches = 'tight')
plt.show()
Likewise, as heatmaps with Plotly require the dataset to be in a tidy (or long) form. The data was transformed once again using the method melt from Pandas.
# Dataframe for Heatmap with top 30 companies demanding data jobs in tidy format
jobs_per_company_tidy_df = pd.melt(jobs_per_company_df.reset_index().sort_values('Company', ascending = False),
id_vars = 'Company', var_name = 'Job', value_name = 'Vacancies')
jobs_per_company_tidy_df.head()
| Company | Job | Vacancies | |
|---|---|---|---|
| 0 | Softtek | BI Analyst | 1.0 |
| 1 | Servicios Axity Mexico... | BI Analyst | 0.0 |
| 2 | Santander | BI Analyst | 0.0 |
| 3 | Reclutamiento En Tecno... | BI Analyst | 0.0 |
| 4 | Pepsico División Foods | BI Analyst | 1.0 |
# Heatmap with Plotly
fig = px.density_heatmap(jobs_per_company_tidy_df,
y='Company',
x='Job',
z='Vacancies',
color_continuous_scale= heatmap_px_colorscale,
height=1000,
width=900,
title= f'<b>Demand Per Company (Top {top_companies_data_jobs}) & Data Job Category</b>',
labels={'Job': '<b>Data Job Category</b>',
'Company': '<b>Company</b>'},
)
fig.update_layout(title_x=0.5,
coloraxis_colorbar=dict(title="<b>Vacancies</b>"),
font=font_px)
fig.update_xaxes(categoryorder='array', categoryarray=heatmap_px_categoryarray)
fig.write_image("Figures/Fig6_DemandPerCompanyAndDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig6_DemandPerCompanyAndDataJobCategory2.svg", scale=2)
fig.show(config=config)
From the plots above, it is possible to see that, indeed, Data Analyst and Data Engineer positions are the ones most demanded by the companies in Mexico at the moment of this study; whereas Data Architect and ML Engineer positions are the less demanded.
Notwithstanding with the above, the heatmap suggests that Data Analyst and Data Engineer positions are more demanded across different organizations. On the contrary, Data Scientist, Data Architect and, certainly, ML Engineers vacancies are demanded in more specific organizations like tech consulting companies and banks.
To answer this question, a heatmap was drawn in order to show the companies demanding data jobs and its most common locations. Likewise, the lighter the color, the greater the demand.
In this context, two dataframes were prepared. One with all the companies and its locations, and the second one with only the top 30 companies demanding data jobs. This, with the purpose of drawing two corresponding heatmaps.
The dataframe with the data for all the companies demanding data jobs was prepared by using the method pivot_table from Pandas. Then, the second dataframe was prepared by means of slicing. If the demand in a specific state did not exist, the data was filled with zeros. The vacancies without a disclosed company were removed from the dataset.
Finally, heatmaps were drawn using both Seaborn and Plotly.
# Base dataframe for analyzing the location of the companies demanding data jobs
companies_by_location_df = (pd.pivot_table(data = df, index = 'Company', columns = 'Location',
values = 'Job', aggfunc = 'count').transpose()
.join(states_df, how = 'outer').drop(columns =['ID', 'Lat', 'Long'])
.transpose().fillna(0).astype(float).drop('Confidential')
.assign(Total=lambda d: d.sum(1)).sort_values('Total', ascending = False)
.drop(columns = 'Total').reset_index().rename(columns = {'index': 'Company'})
.set_index('Company')
)
companies_by_location_df.head()
| Aguascalientes | Baja California | Baja California Sur | Campeche | Chiapas | Chihuahua | Ciudad de México | Coahuila | Colima | Durango | ... | Remote/NA | San Luis Potosí | Sinaloa | Sonora | Tabasco | Tamaulipas | Tlaxcala | Veracruz | Yucatán | Zacatecas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Company | |||||||||||||||||||||
| Bairesdev Llc / (Tax I... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 24.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| Banamex | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Pepsico División Foods | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 9.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| Softtek | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | ... | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Capgemini México | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 33 columns
# Dataframe with all companies and its locations
all_companies_by_location_df = (companies_by_location_df
.sort_index()
.transpose()
)
all_companies_by_location_df.head()
| Company | 5-18 Consulting Group ... | Addon Technologies Inc. | Adecco | Adi Global | Aerovías De México, Sa... | Ait Vanguardia Tecnoló... | Alia | Aliat Universidades | Alten Ingenieria Mexic... | Arca Continental | ... | Universidad Latinoamer... | Universidad Popular Au... | Universidad Tecmilenio | Volaris | Waldo'S Dolar Mart De ... | Wipro Technologies, S.... | Work Cast Solutions Sa... | Workable Ats | Zegovia Rh Sa De Cv | Zillow Group Inc. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aguascalientes | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Baja California | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Baja California Sur | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Campeche | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Chiapas | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 255 columns
# Top 30 companies demanding data jobs and its locations
top_companies_by_location = 30
top_companies_by_location_df = (companies_by_location_df[:top_companies_by_location]
.sort_index()
)
top_companies_by_location_df.head()
| Aguascalientes | Baja California | Baja California Sur | Campeche | Chiapas | Chihuahua | Ciudad de México | Coahuila | Colima | Durango | ... | Remote/NA | San Luis Potosí | Sinaloa | Sonora | Tabasco | Tamaulipas | Tlaxcala | Veracruz | Yucatán | Zacatecas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Company | |||||||||||||||||||||
| Adecco | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Bairesdev Llc / (Tax I... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 24.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| Banamex | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Bancoppel | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Banorte | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 33 columns
# Heatmap of the top 30 companies demanding data jobs and their locations in Seaborn
plt.figure(figsize = (12,9))
ax = sns.heatmap(top_companies_by_location_df,
cmap = "Blues_r",
mask = (top_companies_by_location_df == 0))
ax.set_facecolor('xkcd:black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
plt.xlabel('Location')
plt.ylabel(f'Company')
plt.title(f'Demand Per Company (Top {top_companies_by_location}) & Location\n')
plt.grid(False)
plt.savefig('Figures/Fig7_DemandPerCompanyAndLocationTop30.png', bbox_inches = 'tight')
plt.show()
Then, to use Plotly, the dataframe was converted into a tidy (or long) format.
# Converting top 30 companies and their locations dataframe into tidy data
top_companies_by_location_tidy_df = pd.melt(top_companies_by_location_df.reset_index()
.sort_values(by = 'Company', ascending = False),
id_vars = 'Company', var_name = 'Location', value_name = 'Vacancies')
top_companies_by_location_tidy_df.head()
| Company | Location | Vacancies | |
|---|---|---|---|
| 0 | Softtek | Aguascalientes | 1.0 |
| 1 | Sociedad Nacional Prom... | Aguascalientes | 0.0 |
| 2 | Servicios Axity Mexico... | Aguascalientes | 0.0 |
| 3 | Santander | Aguascalientes | 0.0 |
| 4 | Reclutamiento En Tecno... | Aguascalientes | 0.0 |
# Heatmap of the top 30 companies demanding data jobs and their locations in Plotly
fig = px.density_heatmap(top_companies_by_location_tidy_df,
y='Company',
x='Location',
z='Vacancies',
color_continuous_scale= heatmap_px_colorscale,
height=900,
width=1100,
title= f'<b>Demand Per Company (Top {top_companies_by_location}) & Location</b>',
labels={'Location': '<b>Location</b>',
'Company': '<b>Company</b>'},
)
fig.update_layout(title_x=0.5,
coloraxis_colorbar=dict(title="<b>Vacancies</b>"),
font=font_px)
fig.update_xaxes(tickangle = 270)
fig.write_image("Figures/Fig7_DemandPerCompanyAndLocationTop30_2.png", scale=2)
fig.write_image("Figures/Fig7_DemandPerCompanyAndLocationTop30.svg", scale=2)
fig.show(config=config)
As expectable, most of the companies locate in Mexico City as the large majority of the vacancies are offered there. However, the heatmap shows that there are some organizations that are spread across several Mexican states such as Bairesdev or Pepsico.
Futhermore, there are few well-known companies whose data jobs demand is not located in the capital region, such as Jonhson Controls which is located in Nuevo León.
Then, the location of all companies demanding data jobs was explored with a second heatmap.
# Heatmap of all companies demanding data jobs and their locations with Matplotlib
plt.figure(figsize = (12,9))
ax = sns.heatmap(all_companies_by_location_df,
cmap = "Blues_r",
mask = (all_companies_by_location_df == 0))
ax.set_facecolor('xkcd:black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
plt.xlabel('Companies Demanding Data Jobs')
plt.ylabel('Location')
plt.xticks([])
plt.title('Demand Per Company & Location\n')
plt.grid(False)
plt.savefig('Figures/Fig8_DemandPerCompanyAndLocation.png', bbox_inches = 'tight')
plt.show()
# Converting all companies and their locations dataframe into tidy data
all_companies_by_location_tidy_df = (all_companies_by_location_df.reset_index()
.rename(columns = {'index': 'Location'})
.set_index('Location')
.transpose().reset_index()
.sort_values('Company', ascending = True)
)
all_companies_by_location_tidy_df = pd.melt(all_companies_by_location_tidy_df,
id_vars = 'Company', var_name= 'Location',
value_name = 'Vacancies').sort_values(by='Location', ascending = False)
all_companies_by_location_tidy_df.head()
| Company | Location | Vacancies | |
|---|---|---|---|
| 8414 | Zillow Group Inc. | Zacatecas | 0.0 |
| 8287 | Inacomb | Zacatecas | 0.0 |
| 8253 | Financiera Sustentable... | Zacatecas | 0.0 |
| 8252 | Financiera Independenc... | Zacatecas | 0.0 |
| 8251 | Femsa | Zacatecas | 0.0 |
# Heatmap of all companies demanding data jobs and their locations with Plotly
fig = px.density_heatmap(all_companies_by_location_tidy_df,
y='Location',
x='Company',
z='Vacancies',
color_continuous_scale= heatmap_px_colorscale,
height=900,
width=1000,
title= '<b>Demand Per Company & Location</b>',
labels={'Location': '<b>Location</b>',
'Company': ''},
)
fig.update_layout(title_x=0.5,
coloraxis_colorbar=dict(title="<b>Vacancies</b>"),
font=font_px,
xaxis={'side': 'top'})
fig.update_xaxes(showticklabels=False)
fig.add_annotation(
xref="paper",
yref="paper",
x=0.5,
y=-0.05,
text='<b>Companies Demanding Data Jobs</b>',
showarrow=False,
font=dict(size=17, color ='dimgray')
)
fig.write_image("Figures/Fig8_DemandPerCompanyAndLocation2.png", scale=2)
fig.write_image("Figures/Fig8_DemandPerCompanyAndLocation2.svg", scale=2)
fig.show(config=config)
Likewise, the above heatmap beautifully shows that most of the data jobs are concentrated in Ciudad de México and, to a lesser extent, in Nuevo León and Jalisco.
On the other hand, it is also noteworthy that remote positions are also becoming a trend among companies demanding datajobs.
To answer this question, a bar plot was drawn. And, to draw such a figure, the corresponding dataset was prepared by using the methods groupby and count from Pandas.
It is also important to bear in mind that, from now on, the second dataset with only the vacancies with disclosed salary information, _salarydf, was used for analysis.
The bar plots were drawn using both Matplotlib and Plotly.
# Count of observations per Data Job Category
obs_per_job_df = (salary_df.groupby('Job')[['Avg Salary']].count()
.rename(columns = {'Avg Salary': 'Observations'})
.sort_values('Observations', ascending = True)
)
obs_per_job_df
| Observations | |
|---|---|
| Job | |
| ML Engineer | 1 |
| Data Architect | 9 |
| Data Scientist | 10 |
| Business Analyst | 23 |
| BI Analyst | 26 |
| Data Engineer | 38 |
| Data Analyst | 90 |
# Bar plot of the salary observations per data job category in Matplotlib
obs_per_job_plt = obs_per_job_df.plot.barh(color = sns.color_palette('Blues_r')[0],
figsize = (7, 6),
alpha = 0.6,
width = 0.9)
for container in obs_per_job_plt.containers:
obs_per_job_plt.bar_label(container, fmt='{:,.0f}', fontweight='bold')
plt.legend([])
plt.xlabel('Observations with Disclosed Salary')
plt.ylabel('Data Job Category')
plt.title('Number of Salary Observations Per Data Job Category\n')
plt.savefig('Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory.png', bbox_inches = 'tight')
plt.show()
# Bar plot of the salary observations per data job category in Plotly
# Defining bar colors
colors = [px.colors.sequential.Blues[4],]*8
colors[0] = px.colors.sequential.Blues[7]
colors[1] = px.colors.sequential.Blues[7]
colors[2] = px.colors.sequential.Blues[7]
# Bar plot
fig = px.bar(obs_per_job_df.reset_index(),
x='Observations',
y='Job',
color = 'Observations',
color_continuous_scale=px.colors.sequential.Blues,
title= '<b>Number of Salary Observations Per Data Job Category</b>',
labels={'Job': '<b>Data Job Category</b>',
'Observations': '<b>Observations with Disclosed Salary</b>'},
height = 600,
width = 800,
text_auto=True
)
fig.update_layout(title_x=0.5,
font=font_px)
fig.update_traces(marker_color=colors,
marker_line_color='white',
marker_line_width=1,
opacity=0.6)
fig.write_image("Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory2.svg", scale=2)
fig.show(config=config)
In view of the plots above, it is necessary to be careful with the salary results for the Data Scientist, Data Architect, and ML Engineer positions, as less than 20 observations were collected. However, it is very positive to have collected salary observations for all data jobs categories.
To answer this question, bar charts and boxplots were drawn based on the salaries for each data job category.
For drawing the bar chart, a dataframe was built by using the groupby and mean methods from pandas.
On the contrary, for drawing the boxplots, the original dataframe salary_df was used in order to show the dispersion of the salary observations for each data job category.
Finally, the bar charts and boxplots were drawn using both Seaborn and Plotly.
# Dataframe with the Avg Monthly Salary per Data Job Category
salary_per_job_df = (salary_df.groupby('Job')[['Avg Salary']].mean()
.reset_index()
.sort_values('Avg Salary', ascending = False)
)
salary_per_job_df
| Job | Avg Salary | |
|---|---|---|
| 6 | ML Engineer | 65000.000000 |
| 3 | Data Architect | 54222.222222 |
| 4 | Data Engineer | 45032.894737 |
| 5 | Data Scientist | 43150.050000 |
| 1 | Business Analyst | 31893.326087 |
| 0 | BI Analyst | 28632.711538 |
| 2 | Data Analyst | 24026.205556 |
# Average Salary per Data Job Category with Matplotlib
palette = [sns.color_palette('Blues_r')[3],]*7
palette[0] = sns.color_palette('Blues_r')[0]
plt.figure(figsize = (7, 6))
ax = sns.barplot(data=salary_per_job_df,
x='Avg Salary',
y='Job',
hue='Job',
palette=palette,
legend=False,
alpha=0.7,
edgecolor='white')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
for container in ax.containers:
ax.bar_label(container, fmt='${:,.0f}', fontweight='bold')
plt.legend([])
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel('Data Job Category')
plt.grid({'both'})
plt.title('Average Salary Per Data Job Category\n')
plt.savefig('Figures/Fig10_AverageSalaryPerDataJobCategory.png', bbox_inches = 'tight')
plt.show()
# Average Salary per Data Job Category with Plotly
colors = [px.colors.sequential.Blues[4],]*6
colors.append(px.colors.sequential.Blues[7])
fig = px.bar(salary_per_job_df.sort_values('Avg Salary', ascending=True),
x='Avg Salary',
y='Job',
color='Avg Salary',
color_continuous_scale=px.colors.sequential.Blues,
title='<b>Average Salary Per Data Job Category</b>',
labels={'Job': '<b>Data Job Category</b>',
'Avg Salary': '<b>Average Monthly Salary (MXN)</b>'},
height=600,
width=800,
opacity=0.6,
text_auto='$.3s'
)
fig.update_layout(#yaxis={'categoryorder':'array', 'categoryarray': order},
title_x=0.5,
font=font_px)
fig.update_traces(marker_color=colors,
marker_line_color='white',
marker_line_width=1)
fig.update_xaxes(tickformat='$,~s')
fig.write_image("Figures/Fig10_AverageSalaryPerDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig10_AverageSalaryPerDataJobCategory2.svg", scale=2)
fig.show(config=config)
d:\DataJobsMX-Nov2023\.venv\lib\site-packages\numpy\core\numeric.py:2468: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
From the bar charts above, it is clear that ML Engineer positions are the data job category with the highest monthly average salary with about $65,000 MXN.
Furthermore, the second place corresponds to the Data Architect positions with an average monthly salary of about $55,000 MXN.
The rest of the data job positions have the following approximate average monthly salaries:
Data Engineer: $45,000
Data Scientist: $43,000
Business Analyst: $32,000
BI Analyst: $29,000
Data Analyst: $24,000
However, the calculated monthly average salaries might be affected by outliers. So, the monthly median salaries were estimated with the help of the boxplots below.
# Box plot for each Data Job Category with Matplotlib
plt.figure(figsize=(10,6))
ax = sns.boxplot(data=salary_df,
x='Job',
y='Avg Salary',
order = salary_per_job_df.Job,
#hue='Job',
palette = sns.color_palette("Blues_r", 7),
#legend=False
)
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.xlabel('Data Job Category')
plt.ylabel('Average Monthly Salary (MXN)')
plt.title('Salary Per Data Job Category\n')
plt.savefig('Figures/Fig11_SalaryPerDataJobCategory.png', bbox_inches = 'tight')
plt.show()
C:\Users\dlope\AppData\Local\Temp\ipykernel_2476\2368826657.py:3: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
# Box plot for each Data Job Category with Plotly
fig = px.box(salary_df,
x="Job",
y="Avg Salary",
color="Job",
points="all",
color_discrete_sequence=px.colors.sequential.Blues_r,
category_orders={"Job": salary_per_job_df.Job},
labels={"Avg Salary": "<b>Average Monthly Salary (MXN)</b>",
"Job": "<b>Data Job Category</b>"},
title='<b>Salary Per Data Job Category</b>',
height=550,
width=1000
)
fig.update_layout(title_x=0.5,
font=font_px)
fig.update_traces(showlegend=False)
fig.update_yaxes(tickformat = '$,~s')
fig.write_image("Figures/Fig11_SalaryPerDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig11_SalaryPerDataJobCategory2.svg", scale=2)
fig.show(config=config)
# Calculating median average salaries by data job category
(salary_df.groupby('Job')[['Avg Salary']].median()
.sort_values('Avg Salary', ascending=False)
.rename(columns={"Avg Salary":"Median Salary"}))
| Median Salary | |
|---|---|
| Job | |
| ML Engineer | 65000.0 |
| Data Architect | 57500.0 |
| Data Engineer | 41500.0 |
| Data Scientist | 38250.0 |
| Business Analyst | 31000.0 |
| BI Analyst | 27500.0 |
| Data Analyst | 19000.0 |
The boxplot suggests that, after removing outliers from the calculation of the salaries distributions, ML Engineer positions are still the ones with the highest salaries in the current Mexican labor market with a monthly median salary of about 65,000 MXN. However, it is very important to bear in mind that only one salary observation has been collected for this data job category; so, this result should be taken with caution.
Then, the second monthly highest median salary correspond to Data Architect positions with about 57,500 MXN.
On the other hand, it was found that Data Engineer and Data Scientist positions have a median monthly salary of about 41,500 MXN and about 38,250 MXN, respectively.
Moreover, Business Analyst and BI Analyst positions have a median monthly salary of about 31,000 MXN and about 27,500 MXN, respectively.
Finally, Data Analyst positions are the ones with the lowest salaries in the Data Jobs labor market with only a median monthly salary of about 19,00 MXN.
Please refer to the section 6. Statistical Analysis below of the present notebook for a set of statistical tests applied to the salaries of each data job category.
# Dataframe with the Mean and Median Monthly Salary per Data Job Category
avg_salary_per_job_df = (salary_df.groupby('Job')[['Avg Salary']].aggregate(['mean', 'median'])
.droplevel(level=0, axis=1)
.rename(columns={'mean':'Mean Salary', 'median':'Median Salary'}).
reset_index()
)
avg_salary_per_job_df = pd.melt(avg_salary_per_job_df, id_vars='Job', var_name='Measure', value_name='Value')
avg_salary_per_job_df
| Job | Measure | Value | |
|---|---|---|---|
| 0 | BI Analyst | Mean Salary | 28632.711538 |
| 1 | Business Analyst | Mean Salary | 31893.326087 |
| 2 | Data Analyst | Mean Salary | 24026.205556 |
| 3 | Data Architect | Mean Salary | 54222.222222 |
| 4 | Data Engineer | Mean Salary | 45032.894737 |
| 5 | Data Scientist | Mean Salary | 43150.050000 |
| 6 | ML Engineer | Mean Salary | 65000.000000 |
| 7 | BI Analyst | Median Salary | 27500.000000 |
| 8 | Business Analyst | Median Salary | 31000.000000 |
| 9 | Data Analyst | Median Salary | 19000.000000 |
| 10 | Data Architect | Median Salary | 57500.000000 |
| 11 | Data Engineer | Median Salary | 41500.000000 |
| 12 | Data Scientist | Median Salary | 38250.000000 |
| 13 | ML Engineer | Median Salary | 65000.000000 |
# Mean and Median Salary per Data Job Category with Matplotlib
palette = [sns.color_palette('Blues_r')[0],sns.color_palette('Blues_r')[3]]
plt.figure(figsize = (7, 6))
ax = sns.barplot(data=avg_salary_per_job_df.sort_values('Value', ascending=False),
x='Value',
y='Job',
hue='Measure',
palette=palette,
legend=True,
alpha=0.7,
edgecolor='white')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
for container in ax.containers:
ax.bar_label(container, fmt='${:,.0f}', fontweight='bold')
plt.legend(facecolor='white')
plt.xlabel('Monthly Salary (MXN)')
plt.ylabel('Data Job Category')
plt.grid({'both'})
plt.title('Mean & Median Salary Per Data Job Category\n')
plt.savefig('Figures/Fig12_MeanMedianSalaryPerDataJob.png', bbox_inches = 'tight')
plt.show()
avg_salary_per_job_df.sort_values('Value', ascending=True).query("Measure == 'Mean Salary'")['Job']
2 Data Analyst 0 BI Analyst 1 Business Analyst 5 Data Scientist 4 Data Engineer 3 Data Architect 6 ML Engineer Name: Job, dtype: object
y = avg_salary_per_job_df.sort_values('Value', ascending=True).query("Measure == 'Mean Salary'")['Job']
x_mean = avg_salary_per_job_df.sort_values('Value', ascending=True).query("Measure == 'Mean Salary'")['Value']
x_median = avg_salary_per_job_df.sort_values('Value', ascending=True).query("Measure == 'Median Salary'")['Value']
opacity=0.7
fig = go.Figure()
fig.add_trace(go.Bar(y=y,
x=x_mean,
name='Mean Salary',
marker_color=px.colors.sequential.Blues[7],
opacity=opacity,
text=['$'+str(round(i/1000,1))+'k' for i in x_mean],
orientation='h'
))
fig.add_trace(go.Bar(y=y,
x=x_median,
name='Median Salary',
marker_color=px.colors.sequential.Blues[4],
opacity=opacity,
text=['$'+str(round(i/1000,1))+'k' for i in x_median],
orientation='h'
))
fig.update_layout(title='<b>Mean & Median Salary Per Data Job Category</b>',
title_x=0.5,
yaxis=dict(title='<b>Data Job Category</b>'),
xaxis=dict(title='<b>Monthly Salary (MXN)</b>'),
height=600,
width=800,
barmode='group',
font=font_px,
legend=legend_px,
)
fig.update_traces(
marker_line_color='white',
marker_line_width=1,
)
fig.update_xaxes(tickformat='$,~s')
fig.write_image("Figures/Fig12_MeanMedianSalaryPerDataJob2.png", scale=2)
fig.write_image("Figures/Fig12_MeanMedianSalaryPerDataJob2.svg", scale=2)
fig.show(config=config)
To answer this question, a heatmap was drawn to show the relationship among location, data job category and average salary. Likewise, the lighter the color, the higher the salary.
To do so, the appropriate dataframe was build using the methods pivot_table and join from Pandas. In this sense, the pivot salary table by location was joined with the states dataframe in order to display all the states within Mexico, and not only those with vacancies. Moreover, not disclosed locations were removed.
Finally, heatmaps were drawn using both Seaborn and Plotly.
# Dataframe with averages salaries per location and data job category
salary_location_df = (pd.pivot_table(data = salary_df,
index = 'Location',
columns = 'Job',
values = 'Avg Salary',
aggfunc= 'mean')
.join(states_df, how = 'outer').drop(columns =['ID', 'Lat', 'Long'])
.fillna(0).sort_index(ascending= True)
)
salary_location_df.head()
| BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer | |
|---|---|---|---|---|---|---|---|
| Aguascalientes | 0.0 | 0.0 | 18500.0 | 0.0 | 11000.0 | 0.0 | 0.0 |
| Baja California | 31000.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Baja California Sur | 0.0 | 0.0 | 10250.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Campeche | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| Chiapas | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
# Heatmap of the Salaries per Data Job category and Location with Matplotlib
plt.figure(figsize = (8,10))
formatter = mpl.ticker.StrMethodFormatter('${x:,.0f}')
ax = sns.heatmap(salary_location_df,
cmap = "Blues_r",
mask = (salary_location_df == 0),
cbar_kws={"format": formatter})
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel('Location')
plt.title(f'Salary Per Location And Data Job Category\n')
plt.xticks(rotation = 330)
plt.grid(False)
plt.savefig('Figures/Fig13_SalaryPerLocationAndDataJobCategory.png', bbox_inches = 'tight')
plt.show()
# Transforming dataframe into tidy format
salary_location_tidy_df = salary_location_df.rename_axis('Location').reset_index()
salary_location_tidy_df = pd.melt(salary_location_tidy_df, id_vars= 'Location', var_name = 'Job', value_name = 'Salary')
salary_location_tidy_df = salary_location_tidy_df.sort_values('Location', ascending = False)
salary_location_tidy_df.head()
| Location | Job | Salary | |
|---|---|---|---|
| 230 | Zacatecas | ML Engineer | 0.0 |
| 32 | Zacatecas | BI Analyst | 0.0 |
| 197 | Zacatecas | Data Scientist | 0.0 |
| 98 | Zacatecas | Data Analyst | 0.0 |
| 131 | Zacatecas | Data Architect | 0.0 |
# Heatmap of the Salaries per Data Job category and Location with Plotly
fig = px.density_heatmap(salary_location_tidy_df,
y='Location',
x = 'Job',
z = 'Salary',
histfunc="avg",
color_continuous_scale=heatmap_px_colorscale,
#color_continuous_scale="Blues",
height=1000,
width=900,
title='<b>Salary Per Location And Data Job Category</b>',
labels={
'Job': '<b>Data Job Category</b>',
'Location':'<b>Location</b>'
}
)
fig.update_layout(title_x=0.5,
font=font_px,
coloraxis_colorbar=dict(title="<b>Monthly Salary <br>(MXN)</b>"))
fig.update_traces(colorbar_title_text='Monthly Salary (MXN)')
fig.update_coloraxes(colorbar_tickformat = '$,~s')
fig.update_xaxes(categoryorder='array', categoryarray=heatmap_px_categoryarray)
fig.write_image("Figures/Fig13_SalaryPerLocationAndDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig13_SalaryPerLocationAndDataJobCategory2.svg", scale=2)
fig.show(config=config)
In general, the above plot suggests that the highest salaries can be found in Remote, Mexico City, Jalisco, Estado de México and San Luis Potosí. However the observation for the latter state is atypical and should be interpreted with caution.
More specifically, the highest salaries for the different data jobs categories can be found in the following states:
| Data Job Category | Locations with the Highest Average Salaries |
|---|---|
| BI Analyst | Remote |
| Business Analyst | Jalisco |
| Data Analyst | Remote |
| Data Architect | Remote |
| Data Engineer | Jalisco |
| Data Scientist | Ciudad de México |
| ML Engineer | Jalisco |
To answer this question, a lollipop chart was drawn using the top average salaries per company. In this sense, the larger the lollilop, the higher the salary offered by a given company.
The dataset was built by using the grouby method from Pandas.
Finally, the charts were drawn using both Matplotlib and Plotly.
# Dataframe with the top averages salaries per company
top_salary_company = 20
salary_company_df = (salary_df.groupby('Company')[['Avg Salary']].mean()
.drop('Confidential')
.sort_values('Avg Salary', ascending = False)[:top_salary_company]
)
salary_company_df.head()
| Avg Salary | |
|---|---|
| Company | |
| Ecosistemex S. De R.L.... | 90000.000000 |
| Caspex Corp | 77500.000000 |
| Addon Technologies Inc. | 77500.000000 |
| Enterprise Solutions, ... | 71666.666667 |
| Softtek | 70000.000000 |
# Companies paying the highest salaries with Matplotlib
color = sns.color_palette('Blues_r')[0]
plt.figure(figsize = (8,7))
ax = sns.scatterplot(data = salary_company_df,
x = 'Avg Salary',
y = 'Company',
color = color
)
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
for index, value in salary_company_df.sort_values('Avg Salary')['Avg Salary'].items():
plt.hlines(xmax = value, xmin = np.min(salary_company_df['Avg Salary']), y = index, color=color) # Stems
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel(f'Company')
plt.xticks(rotation = 0)
plt.title(f'Top {top_salary_company} Companies Paying The Highest Salaries\n')
plt.savefig('Figures/Fig14_Top20SalaryPerCompany.png', bbox_inches = 'tight')
plt.show()
# Companies paying the highest salaries with Plotly
color=px.colors.sequential.Blues[7]
fig = px.scatter(salary_company_df.reset_index().sort_values('Avg Salary'),
x='Avg Salary',
y='Company',
height = 700,
width = 800,
title = f'<b>Top {top_salary_company} Companies Paying The Highest Salaries</b>',
labels = {'Avg Salary':'<b>Average Monthly Salary (MXN)</b>',
'Company':'<b>Company</b>'},
)
fig.update_layout(title_x=0.5, font=font_px)
fig.update_traces(marker_color=color, marker_line_color=color)
fig.update_xaxes(tickformat = '$,~s')
for index, value in salary_company_df.sort_values('Avg Salary')['Avg Salary'].items():
fig.add_shape(type='line', y0=index, y1= index, x0=np.min(salary_company_df['Avg Salary']), x1= value, xref='x',
yref='y', line=dict(color= px.colors.sequential.Blues[7]))
fig.write_image("Figures/Fig14_Top20SalaryPerCompany2.png", scale=2)
fig.write_image("Figures/Fig14_Top20SalaryPerCompany2.svg", scale=2)
fig.show(config=config)
From the plot aboves, the companies offering the highest salaries are Ecosistemex, Caspex, Addon Technologies, Enterprises Solutions, and Softtek. From those, 2 are recruiting agencies and 3 are tech consulting firms.
To answer this question, a heatmap was drawn to show the relationship among company, data job category and average salary. Likewise, the lighter the color, the higher the salary.
To do so, the appropriate dataframe was build using the methods pivot_table and join from Pandas. Moreover, not disclosed employers were removed.
Finally, heatmaps were drawn using both Seaborn and Plotly.
# Dataframe with averages salaries per company and data job category
top_salary_company_per_data_job = 30
salary_company_per_data_job_df = (pd.pivot_table(data = salary_df,
index = 'Company',
columns = 'Job',
values = 'Avg Salary',
aggfunc= 'mean')
.fillna(0).sort_index(ascending= True)
.drop('Confidential')
.assign(Average= lambda x: x[x > 0].mean(axis=1, numeric_only= True))
.rename(columns={'Average':'Total Average'})
.sort_values('Total Average', ascending = False)[:top_salary_company_per_data_job]
.sort_index(ascending= True)
)
salary_company_per_data_job_df.head()
| Job | BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer | Total Average |
|---|---|---|---|---|---|---|---|---|
| Company | ||||||||
| Addon Technologies Inc. | 0.0 | 0.0 | 0.0 | 0.0 | 77500.0 | 0.0 | 0.0 | 77500.0 |
| Ait Vanguardia Tecnoló... | 0.0 | 0.0 | 0.0 | 0.0 | 43500.0 | 0.0 | 0.0 | 43500.0 |
| Alia | 0.0 | 0.0 | 62500.0 | 0.0 | 0.0 | 0.0 | 0.0 | 62500.0 |
| Caspex Corp | 0.0 | 0.0 | 77500.0 | 0.0 | 0.0 | 0.0 | 0.0 | 77500.0 |
| Cognizant Technology S... | 0.0 | 0.0 | 0.0 | 0.0 | 65000.0 | 0.0 | 0.0 | 65000.0 |
# Heatmap of the Companies Offering the Highest Salaries Per Data Job Category with Matplolib
plt.figure(figsize = (8,9))
formatter = mpl.ticker.StrMethodFormatter('${x:,.0f}')
ax = sns.heatmap(data = salary_company_per_data_job_df.drop(columns = 'Total Average'),
cmap = 'Blues_r',
mask = (salary_company_per_data_job_df.drop(columns = 'Total Average') == 0),
cbar_kws={"format": formatter}
)
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel('Company')
plt.title(f'Salary Per Company (Top {top_salary_company_per_data_job}) And Data Job Category\n')
plt.xticks(rotation = 330)
plt.grid(False)
plt.savefig('Figures/Fig15_SalaryPerCompanyAndDataJobCategory.png', bbox_inches = 'tight')
plt.show()
# Converting dataframe into a tidy format
salary_company_per_data_job_tidy_df = (salary_company_per_data_job_df
.copy()
.drop(columns = 'Total Average')
.reset_index()
)
salary_company_per_data_job_tidy_df = pd.melt(salary_company_per_data_job_tidy_df,
id_vars = 'Company',
var_name = 'Job',
value_name = 'Salary')
salary_company_per_data_job_tidy_df.head()
| Company | Job | Salary | |
|---|---|---|---|
| 0 | Addon Technologies Inc. | BI Analyst | 0.0 |
| 1 | Ait Vanguardia Tecnoló... | BI Analyst | 0.0 |
| 2 | Alia | BI Analyst | 0.0 |
| 3 | Caspex Corp | BI Analyst | 0.0 |
| 4 | Cognizant Technology S... | BI Analyst | 0.0 |
# Heatmap of the Companies Offering the Highest Salaries Per Data Job Category with Plotly
fig = px.density_heatmap(salary_company_per_data_job_tidy_df.sort_values(by = 'Company', ascending = False),
y='Company',
x = 'Job',
z = 'Salary',
histfunc="avg",
color_continuous_scale=heatmap_px_colorscale,
height=1000,
width=900,
title= f'<b>Salary Per Company (Top {top_salary_company_per_data_job}) And Data Job Category</b>',
labels={'Job': '<b>Data Job Category</b>',
'Company':'<b>Company</b>'}
)
fig.update_layout(title_x=0.5,
coloraxis_colorbar=dict(title="<b>Monthly Salary<br>(MXN)</b>"),
font=font_px)
fig.update_xaxes(categoryorder='array', categoryarray= heatmap_px_categoryarray)
fig.update_coloraxes(colorbar_tickformat = '$,~s')
fig.write_image("Figures/Fig15_SalaryPerCompanyAndDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig15_SalaryPerCompanyAndDataJobCategory2.svg", scale=2)
fig.show(config=config)
So, for BI Analyst positions, the company offering higher salaries are Randstad and Ids Comercial.
For Business Analyst positions, the organizations offering higher salaries are Manpower and Totaltech.
Moreover, for Data Analyst positions, the organizations offering higher salaries are Caspex and Santander.
Furthermore, for Data Architect positions, the organizations offering higher salaries are Softtek and Everis.
In addition, for Data Engineer positions, the organizations offering higher salaries are Manpower, and Addon Technologies.
For Data Scientist positions, the organizations offering higher salaries are Ecosistemex and Enterprises Solutions.
Finally, for ML Engineer positions, the only company with a disclosed salary is Enterprise Solutions.
To answer this question, a bar chart was drawn to show the vacancies with a salary in the 85th percentile. The original job titles as published in the job website were used for sake of descriptiveness.
To do so, the percentiles were calculated with the percentile function from Numpy. Then the salary dataframe was filtered to select only those vacancies equal or above the 85th percentile.
Finally, a bar chart was drawn using both Seaborn and Plotly.
# Quantiles calculatuib
percentile = np.percentile(salary_df['Avg Salary'].values, 85)
percentile
57500.0
# Dataframe with vacancies whose salary is within the last decile
top_vacancies = (salary_df[salary_df['Avg Salary'] >= percentile]
.drop(columns=['Min Salary', 'Max Salary'])
.sort_values('Avg Salary', ascending=False)
.reset_index(drop=True)
.assign(Salary_k= lambda d: d['Avg Salary']/1000)
)
top_vacancies
| Original Job Title | Job | Company | Location | Avg Salary | Salary_k | |
|---|---|---|---|---|---|---|
| 0 | Científico de Datos | Data Scientist | Ecosistemex S. De R.L.... | Ciudad de México | 90000.0 | 90.0 |
| 1 | Data Engineer (Databricks) | Data Engineer | Manpower, S.A. De C.V. | Estado de México | 85000.0 | 85.0 |
| 2 | Data engineer (Databricks) | Data Engineer | Manpower, S.A. De C.V. | Estado de México | 85000.0 | 85.0 |
| 3 | Analista de datos Sr. | Data Analyst | Confidential | Ciudad de México | 80000.0 | 80.0 |
| 4 | Sr Data Engineer | Data Engineer | Addon Technologies Inc. | Jalisco | 77500.0 | 77.5 |
| 5 | SQL Server Data Analyst | Data Analyst | Caspex Corp | Remote/NA | 77500.0 | 77.5 |
| 6 | Data Engineer with Fivetran developer | Data Engineer | Enterprise Solutions, ... | Remote/NA | 75000.0 | 75.0 |
| 7 | Data Architect | Data Architect | Softtek | Remote/NA | 75000.0 | 75.0 |
| 8 | Sr Data Scientist | Data Scientist | Enterprise Solutions, ... | Ciudad de México | 75000.0 | 75.0 |
| 9 | Analista de datos sr. | Data Analyst | Confidential | Ciudad de México | 75000.0 | 75.0 |
| 10 | Business Intelligence Program Manager, Advance... | BI Analyst | Randstad | Ciudad de México | 70000.0 | 70.0 |
| 11 | Arquitecto de Datos Azure | Data Architect | Monterrey It Cluster | Ciudad de México | 67500.0 | 67.5 |
| 12 | Sr. Business Systems Analyst | Business Analyst | Confidential | Jalisco | 67500.0 | 67.5 |
| 13 | Arquitecto cloud Datos | Data Architect | Everis | Remote/NA | 67500.0 | 67.5 |
| 14 | Senior Data Engineer (Qlik) | Data Engineer | Confidential | Jalisco | 67500.0 | 67.5 |
| 15 | Arquitecto de Datos Azure | Data Architect | Monterrey It Cluster | Ciudad de México | 65000.0 | 65.0 |
| 16 | Data engineer/SCALA | Data Engineer | Cognizant Technology S... | Jalisco | 65000.0 | 65.0 |
| 17 | Sr. Data Engineer | Data Engineer | Inetum Holding Mexico ... | Remote/NA | 65000.0 | 65.0 |
| 18 | Azure Machine Learning – Technical Support Eng... | ML Engineer | Enterprise Solutions, ... | Jalisco | 65000.0 | 65.0 |
| 19 | Data Engineer Azure | Data Engineer | Softtek | Remote/NA | 65000.0 | 65.0 |
| 20 | Data Engineer | Data Engineer | Work Cast Solutions Sa... | Remote/NA | 65000.0 | 65.0 |
| 21 | Data Analyst | Data Analyst | Santander | Ciudad de México | 65000.0 | 65.0 |
| 22 | Ingeniero de datos / Desarrollador / Hibrido | Data Engineer | Ids Comercial | Ciudad de México | 64000.0 | 64.0 |
| 23 | Data Analyst | Data Analyst | Alia | Ciudad de México | 62500.0 | 62.5 |
| 24 | Oracle Business Analyst I | Business Analyst | Manpower, S.A. De C.V. | San Luis Potosí | 62500.0 | 62.5 |
| 25 | Data Integration Engineer (GCP) | Data Engineer | Kode It S.C. | Ciudad de México | 60500.0 | 60.5 |
| 26 | INGENIERO SR BASE DE DATOS | Data Engineer | Grupo Industrial Miner... | Ciudad de México | 60000.0 | 60.0 |
| 27 | Big Data Engineer Sr - Inglés | Data Engineer | Servicios Axity Mexico... | Ciudad de México | 57500.0 | 57.5 |
| 28 | Arquitecto de Gobierno de datos y empresarial | Data Architect | Confidential | Remote/NA | 57500.0 | 57.5 |
| 29 | Gerente Analista de Datos / Power BI | Data Analyst | Human Staff | Nuevo León | 57500.0 | 57.5 |
| 30 | Analista de Datos en Azure (6 meses) | Data Analyst | Sygno | Nuevo León | 57500.0 | 57.5 |
# Bar chart of the vacancies with the highest salaries in Matplotlib
fig, ax = plt.subplots(figsize = (5,9))
sns.barplot(data=top_vacancies,
x='Salary_k',
y=top_vacancies.index,
hue='Location',
palette='viridis',
legend=True,
alpha=0.7,
dodge=False,
edgecolor='white',
errorbar=None,
ax = ax,
orient='h')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}k'))
ax.set_yticklabels(top_vacancies['Original Job Title'])
ax.set_xlim([min(top_vacancies['Salary_k'])-5, max(top_vacancies['Salary_k'])])
ax2 = ax.secondary_yaxis('right')
ytickslocs = ax.get_yticks()
ax2.set_yticks(ytickslocs, labels=top_vacancies.Company)
ax2.set_ylabel('\nCompany')
plt.legend(facecolor='white')
plt.xlabel('\nAverage Monthly Salary (MXN)')
plt.ylabel('Vacancy')
plt.grid({'both'})
plt.title('Vacancies with the Highest Salaries\n')
plt.savefig('Figures/Fig16_VacanciesHighestSalaries.png', bbox_inches = 'tight')
plt.show()
C:\Users\dlope\AppData\Local\Temp\ipykernel_2476\4261754965.py:17: UserWarning: FixedFormatter should only be used together with FixedLocator
# Bar chart of the vacancies with the highest salaries in Plotly
top_vacancies_plotly = top_vacancies.sort_values('Avg Salary', ascending=True).reset_index(drop=True)
subfig = make_subplots(specs=[[{"secondary_y": True}]],
)
fig = px.bar(top_vacancies_plotly,
x='Avg Salary',
y=top_vacancies_plotly.index,
color='Location',
color_discrete_sequence=px.colors.sequential.Viridis,
orientation='h',
)
fig2 = px.line(top_vacancies_plotly,
x=top_vacancies_plotly.index,
y=top_vacancies_plotly.index,
orientation='h'
)
fig2.update_traces(yaxis="y2")
subfig.add_traces(fig.data + fig2.data)
subfig.update_xaxes(tickformat = '$,~s',
range=[min(top_vacancies_plotly['Avg Salary'])-10000, max(top_vacancies_plotly['Avg Salary'])])
subfig.update_layout(
title='<b>Vacancies with the Highest Salaries</b>',
xaxis=dict(title='<b>Average Monthly Salary (MXN)</b>'),
title_x=0.5,
font=font_px,
legend=dict(
bgcolor='rgba(255,255,255,0.6)',
bordercolor='#cbcccd',
borderwidth=1.5,
yanchor="bottom",
y=0.04,
xanchor="right",
x=0.90,
itemwidth=30,
),
yaxis = dict(
title='<b>Vacancy</b>',
tickvals=top_vacancies_plotly.index,
ticktext=top_vacancies_plotly['Original Job Title'],
showgrid=True
),
yaxis2 = dict(
title='<b>Company</b>',
tickvals=top_vacancies_plotly.index,
ticktext=top_vacancies_plotly['Company'],
showgrid=False,
scaleanchor="y",
scaleratio=1,
),
height=900,
width=1000,
barmode='stack',
bargap=0.15,
)
subfig.update_traces(
marker_line_color='white',
marker_line_width=1,
opacity=0.7,
)
subfig.write_image("Figures/Fig16_VacanciesHighestSalaries2.png", scale=2)
subfig.write_image("Figures/Fig16_VacanciesHighestSalaries2.svg", scale=2)
subfig.show(config=config)
As can be seen from the plots above, the vacancies within the 85th percentile have an average monthly compensation ranging from $57,500 MXN to $90,000 MXN.
Indeed, the top-paying vacancy at the time of this study was a Data Scientist position at Ecosistemex, located in Mexico City.
As expectable, most of those vacancies correspond to senior or managerial positions. However, it is notewhorthy that most data job categories are represented in the 85th percentile with several senior Data Analyst, Data Engineer, Data Scientists, and BI positions within the top vacancies.
It is remarkable that the vacancies with the highest salaries are located in Mexico City, Estado de México, Jalisco, San Luis Potorí, Nuevo León, and in remote.
In this section, an statistical analysis was performed to conclude whether there was any statistical difference among the salary means for each data job category, as well as identifying in which categories the difference was significant as well.
To do so, firstly, the D'Agostino-Pearson normality test was used to assess the normality of the data jobs salary distribution. Then, regardless of results of the normality test, a set of both parametric and non-parametric tests were carried out.
In this sense, the salary means among the data job categories were compared through an ANOVA and a Kruskal-Wallis test.
After that, a post hoc analysis was performed by means of the Tukey-Kramer and Dunn's tests in order to identify which salary differences were, indeed, significant.
Moreover, to confirm the significance of the difference in the salary means for specific pairs of data jobs, the one-sample t-test, two-sample t-test with unequal variance and the Mann-Whitney U test were also performed.
Finally, an effect size analysis was also carried out by computing the absolute mean salary difference, the Cohen's d, and the bootstrap confidence intervals for each data job category. This, in order to assess whether the salary differences are significant not only from a statistical but from a practical point of view.
To answer this question, a histogram was drawn with the retrieved salary data using Seaborn and Plotly.
# Histogram with Seaborn
bins = 8
minh = np.min(salary_df['Avg Salary'])
maxh = np.max(salary_df['Avg Salary'])
ticks = np.linspace(minh,maxh,bins + 1)
plt.figure(figsize = (10, 6))
ax = sns.histplot(data = salary_df,
x = 'Avg Salary',
color = sns.color_palette('Blues_r')[0],
bins = bins,
alpha = 0.6
)
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel('Frequency')
plt.xticks(ticks)
plt.title('Data Jobs Salary Distribution\n')
plt.grid({'both'})
plt.savefig('Figures/Fig17_DataJobsSalaryDistribution.png', bbox_inches = 'tight')
plt.show()
# Histogram with Plotly
size = ( maxh - minh )/bins
fig = px.histogram(salary_df, x = 'Avg Salary',
nbins= bins,
opacity=0.5,
color_discrete_sequence = [px.colors.sequential.Blues_r[1]],
title= '<b>Data Jobs Salary Distribution</b>',
labels={'Avg Salary': '<b>Average Monthly Salary (MXN)</b>'},
height = 600,
width = 800
)
fig.update_layout(yaxis_title_text='<b>Frequency</b>', bargap=0.01, title_x=0.5,
font=font_px)
fig.update_traces(xbins=dict( # bins used for histogram
start=minh,
end=maxh,
size=size
))
fig.update_xaxes(tickvals=list(ticks), tickformat = '$,~s')
fig.write_image("Figures/Fig17_DataJobsSalaryDistribution2.png", scale=2)
fig.write_image("Figures/Fig17_DataJobsSalaryDistribution2.svg", scale=2)
fig.show(config=config)
Most of the Data Jobs salaries are located in the range from $14,750 MXN to $25,500 MXN per month, whereas monthly salaries superior to $68,500 MXN are more scarce.
On the other hand, it is noteworthy that the salary distributions is skewed to the right. So, it might not comply with the normality assumption.
To answer this question, a D’Agostino-Pearson test of normal distribution was carried out using the library Scipy.
$$\mathbf{H_0}: Normal \; Distribution$$ $$\mathbf{H_1}: Not \; H_0$$$$\alpha = 0.05$$# D’Agostino-Pearson test of normal distribution
stat, pvalue = stats.normaltest(salary_df['Avg Salary'])
print(f'The K2 statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.')
The K2 statistic is: 24.808. The p-value is: 0.000.
Indeed, as the p-value is significant ($p < 0.05$), the null hypothesis that the sample comes from a normal distribution is rejected.
To answer this question, multiple histograms were drawn with the salary data for each data job category using Matplotlib and Plotly.
# Multiple histograms with Matplolib
color = sns.color_palette("Blues_r")[0]
edgecolor = 'white'
alpha = 0.5
figure, axis = plt.subplots(nrows=4, ncols=2, figsize = (13,18))
axis[0, 0].hist(salary_df.loc[salary_df['Job'] == 'BI Analyst']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[0, 0].set_title("BI Analyst")
axis[0, 0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
axis[0, 1].hist(salary_df.loc[salary_df['Job'] == 'Business Analyst']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[0, 1].set_title("Business Analyst")
axis[0, 1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
axis[1, 0].hist(salary_df.loc[salary_df['Job'] == 'Data Analyst']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[1, 0].set_title("Data Analyst")
axis[1, 1].hist(salary_df.loc[salary_df['Job'] == 'Data Architect']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[1, 1].set_title("Data Architect")
axis[2, 0].hist(salary_df.loc[salary_df['Job'] == 'Data Engineer']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[2, 0].set_title("Data Engineer")
axis[2, 1].hist(salary_df.loc[salary_df['Job'] == 'Data Scientist']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[2, 1].set_title("Data Scientist")
axis[3, 0].hist(salary_df.loc[salary_df['Job'] == 'ML Engineer']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[3, 0].set_title("ML Engineer")
axis[3, 0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.savefig('Figures/Fig18_SalaryDistributionsPerDataJobCategory.png', bbox_inches = 'tight')
plt.show()
# Multiple histograms with Plotly
bins = 10
fig = make_subplots(
rows=4, cols=2,
subplot_titles=("<b>BI Analyst</b>", "<b>Data Analyst</b>", "<b>Business Analyst</b>", "<b>Data Architect</b>", "<b>Data Engineer</b>", "<b>Data Scientist</b>", "<b>ML Engineer</b>"),
specs=[[{"colspan": 2}, None],
[{}, {}],
[{}, {}],
[{}, {}]]
)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'BI Analyst']['Avg Salary'], nbinsx= bins),
row=1, col=1)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Analyst']['Avg Salary'], nbinsx= bins),
row=2, col=1)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Business Analyst']['Avg Salary'], nbinsx= bins),
row=2, col=2)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Architect']['Avg Salary'], nbinsx= bins),
row=3, col=1)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Engineer']['Avg Salary'], nbinsx= bins),
row=3, col=2)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Scientist']['Avg Salary'], nbinsx= bins),
row=4, col=1)
fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'ML Engineer']['Avg Salary'], nbinsx= bins),
row=4, col=2)
fig.update_layout(height=1300,
width=800,
font=font_px,
title_text="<b>Salary Distributions Per Data Job Category</b>",
title_x=0.5,
bargap=0.02
)
fig.update_traces(marker_color=px.colors.sequential.Blues[7],
marker_line_color='white',
marker_line_width=1,
opacity = 0.5,
showlegend=False)
fig.update_xaxes(tickformat = '$,~s')
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=1, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=2, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=2, col=2)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=3, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=3, col=2)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=4, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=4, col=2)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=1, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=2, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=2, col=2)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=3, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=3, col=2)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=4, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=4, col=2)
fig.write_image("Figures/Fig18_SalaryDistributionsPerDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig18_SalaryDistributionsPerDataJobCategory2.svg", scale=2)
fig.show(config=config)
Indeed, only the salary observations for the BI Analyst, Business Analyst, and Data Engineer positions somewhat resemble a normal distribution, whereas it is not possible to state the same with the observations for the other data job categories. Thus a normal assumption cannot be hold.
Nonetheless, for the purposes of the present study, both parametric (ANOVA, Tukey-Kramer and T-test with unequal variance) and non-parametric (Kruskal-Wallis H, Dunn and Mann-Whitney U) tests were carried out to assess the significance of the obtained results.
To answer this question, an ANOVA and a Kruskal-Wallis H test were performed on the salary data using Scipy and Statsmodels.
$$\mathbf{H_0}: \mu_{BI} = \mu_{BA} = \mu_{DA} = \mu_{DR} = \mu_{DE} = \mu_{DS} = \mu_{ML} $$$$\mathbf{H_1}: Not \; H_0$$$$\alpha = 0.05$$First, the salary observations for each data job category were prepared from the dataset using the pivot_table method from Pandas.
# Salary pivot table by data job category
salary_pivot_df = pd.pivot_table(salary_df, index = 'Company', columns = 'Job', values = 'Avg Salary', aggfunc= 'mean')
salary_pivot_df.head()
| Job | BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer |
|---|---|---|---|---|---|---|---|
| Company | |||||||
| 5-18 Consulting Group ... | NaN | NaN | 30500.000000 | NaN | NaN | NaN | NaN |
| Addon Technologies Inc. | NaN | NaN | NaN | NaN | 77500.0 | NaN | NaN |
| Adecco | 55000.0 | NaN | 18333.166667 | NaN | 27500.0 | NaN | NaN |
| Ait Vanguardia Tecnoló... | NaN | NaN | NaN | NaN | 43500.0 | NaN | NaN |
| Alia | NaN | NaN | 62500.000000 | NaN | NaN | NaN | NaN |
salary_pivot_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 120 entries, 5-18 Consulting Group ... to Zegovia Rh Sa De Cv Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BI Analyst 20 non-null float64 1 Business Analyst 23 non-null float64 2 Data Analyst 64 non-null float64 3 Data Architect 6 non-null float64 4 Data Engineer 30 non-null float64 5 Data Scientist 8 non-null float64 6 ML Engineer 1 non-null float64 dtypes: float64(7) memory usage: 7.5+ KB
It is important to bear in mind that only one salary observations was retrieved for a ML Engineer position.
Then, a Numpy array was defined for each data job category and the NaN values were dropped.
# Array of salary observations by data job category
BI = salary_pivot_df['BI Analyst'].dropna().values
BA = salary_pivot_df['Business Analyst'].dropna().values
DA = salary_pivot_df['Data Analyst'].dropna().values
DR = salary_pivot_df['Data Architect'].dropna().values
DE = salary_pivot_df['Data Engineer'].dropna().values
DS = salary_pivot_df['Data Scientist'].dropna().values
ML = salary_pivot_df['ML Engineer'].dropna().values
Then, the parametric ANOVA test was carried out:
# ANOVA with outliers
stat, pvalue = stats.f_oneway(BI, BA, DA, DR, DE, DS, ML)
print(f'The F statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.\n')
alpha = 0.05
if pvalue > alpha:
print('Test interpretation: Fail to reject H0.')
else:
print('Test interpretation: Reject H0.')
The F statistic is: 8.657. The p-value is: 0.000. Test interpretation: Reject H0.
# ANOVA using the Statsmodels Formula API
model = smf.ols(formula = "Salary ~ C(Job)", data = salary_df.rename(columns = {'Avg Salary': 'Salary'})).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table
| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| C(Job) | 1.924338e+10 | 6.0 | 11.149694 | 1.200306e-10 |
| Residual | 5.465386e+10 | 190.0 | NaN | NaN |
Moreover, the non-parametric Kruskal-Wallis H test was also performed:
# Kruskal-Wallis H Test
stat, pvalue = stats.kruskal(BI, BA, DA, DR, DE, DS, ML)
print(f'The H statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.\n')
alpha = 0.05
if pvalue > alpha:
print('Test interpretation: Fail to reject H0.')
else:
print('Test interpretation: Reject H0.')
The H statistic is: 42.662. The p-value is: 0.000. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the salary differences among the data job categories are statistically significant.
Therefore, the next step is to identify which salary differences among the data jobs are significant.
To answer this question, the Tukey-Kramer and Dunn's post hoc tests were performed on the salary data using Scikit_posthocs.
So, firstly, the data was tranformed into a long format in order to be able to the processed by the library.
# Preparing dataframe in a tidy format
salary_tidy_df = pd.melt(salary_pivot_df, var_name='Job', value_name='Salary').dropna().reset_index(drop=True)
salary_tidy_df.head()
| Job | Salary | |
|---|---|---|
| 0 | BI Analyst | 55000.0 |
| 1 | BI Analyst | 27500.0 |
| 2 | BI Analyst | 17800.0 |
| 3 | BI Analyst | 27500.0 |
| 4 | BI Analyst | 17500.0 |
Then, the post hoc tests were carried out.
# Tukey-Kramer Test
tukey_df = posthoc_tukey(salary_tidy_df, val_col='Salary', group_col='Job')
tukey_df
| BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer | |
|---|---|---|---|---|---|---|---|
| BI Analyst | 1.000000 | 0.900000 | 0.824031 | 0.030082 | 0.027678 | 0.374142 | 0.392984 |
| Business Analyst | 0.900000 | 1.000000 | 0.505134 | 0.053670 | 0.062074 | 0.524874 | 0.460989 |
| Data Analyst | 0.824031 | 0.505134 | 1.000000 | 0.001000 | 0.001000 | 0.026352 | 0.200257 |
| Data Architect | 0.030082 | 0.053670 | 0.001000 | 1.000000 | 0.871216 | 0.900000 | 0.900000 |
| Data Engineer | 0.027678 | 0.062074 | 0.001000 | 0.871216 | 1.000000 | 0.900000 | 0.900000 |
| Data Scientist | 0.374142 | 0.524874 | 0.026352 | 0.900000 | 0.900000 | 1.000000 | 0.900000 |
| ML Engineer | 0.392984 | 0.460989 | 0.200257 | 0.900000 | 0.900000 | 0.900000 | 1.000000 |
# Dunn's Test
dunn_df = posthoc_dunn(salary_tidy_df, val_col='Salary', group_col='Job')
dunn_df
| BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer | |
|---|---|---|---|---|---|---|---|
| BI Analyst | 1.000000 | 0.755463 | 6.152017e-02 | 0.019224 | 1.250973e-02 | 0.130236 | 0.157206 |
| Business Analyst | 0.755463 | 1.000000 | 1.818269e-02 | 0.030046 | 2.396332e-02 | 0.190131 | 0.184929 |
| Data Analyst | 0.061520 | 0.018183 | 1.000000e+00 | 0.000239 | 5.860401e-08 | 0.003024 | 0.055678 |
| Data Architect | 0.019224 | 0.030046 | 2.385867e-04 | 1.000000 | 4.095596e-01 | 0.397706 | 0.739110 |
| Data Engineer | 0.012510 | 0.023963 | 5.860401e-08 | 0.409560 | 1.000000e+00 | 0.825087 | 0.473576 |
| Data Scientist | 0.130236 | 0.190131 | 3.024004e-03 | 0.397706 | 8.250873e-01 | 1.000000 | 0.441439 |
| ML Engineer | 0.157206 | 0.184929 | 5.567757e-02 | 0.739110 | 4.735764e-01 | 0.441439 | 1.000000 |
To easily visualize the results from the post hoc tests, they p-values matrices were converted to boolean values.
alpha = 0.05
tukey_df < alpha
| BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer | |
|---|---|---|---|---|---|---|---|
| BI Analyst | False | False | False | True | True | False | False |
| Business Analyst | False | False | False | False | False | False | False |
| Data Analyst | False | False | False | True | True | True | False |
| Data Architect | True | False | True | False | False | False | False |
| Data Engineer | True | False | True | False | False | False | False |
| Data Scientist | False | False | True | False | False | False | False |
| ML Engineer | False | False | False | False | False | False | False |
dunn_df < alpha
| BI Analyst | Business Analyst | Data Analyst | Data Architect | Data Engineer | Data Scientist | ML Engineer | |
|---|---|---|---|---|---|---|---|
| BI Analyst | False | False | False | True | True | False | False |
| Business Analyst | False | False | True | True | True | False | False |
| Data Analyst | False | True | False | True | True | True | False |
| Data Architect | True | True | True | False | False | False | False |
| Data Engineer | True | True | True | False | False | False | False |
| Data Scientist | False | False | True | False | False | False | False |
| ML Engineer | False | False | False | False | False | False | False |
Thus, according to the results above and significance level of $\alpha = 0.05$ :
In this context, specific data jobs categories were later compared using the one-sample t-test, the two-sample T-test with unequal variance, the Wilcoxon signed-rank test and the Mann-Whitney U test to further confirm the significance of the above results.
The one-sample t-test and the Wilcoxon signed-rank test were used when only one salary observation was collected for a data job category, i.e., ML Engineer positions; whereas the two-sample t-test with unequal variance and the Mann-Whitney U test were performed whenever the sample size of salaries for both data job categories under comparison was larger than 1, i.e., the rest of data job categories.
Thus, a function to compare the means of the different job categories was defined as follows:
# Function to compare the mean salaries using a T-test and the Mann-Whitney U test
def compare_means(sample1, sample2, alternative, equal_var = False, alpha = 0.05):
"""
This function performs both parametric and non-parametrics tests to compare means.
When having two independent samples, the two-sample T-test and the Mann-Whitney U test are performed.
On the other hand, when having only one sample, the one-sample T-test and the Wilcoxon signed-rank
test are performed.
Parameters
sample1: First sample array or single value.
sample2: Second sample array or single value.
alternative: Alternative hypothesis. If 'greater', the mean of the distribution
underlying the first sample is greater than the mean of the distribution underlying
the second sample. If 'less', the mean of the distribution underlying the first
sample is less than the mean of the distribution underlying the second sample.
Finally, if ‘two-sided’, the means of the distributions underlying the samples are unequal.
equal_var: If True, the test performs a standard independent 2 sample test that
assumes equal population variances. If False (default), perform Welch’s t-test,
which does not assume equal population variance.
alpha: Significance level. It is 0.05 by default.
Returns
stat_t: The calculated t-statistic.
pvalue_t: The associated p-value for the chosen alternative from the t-test.
stat_np: The non-parametric statistic corresponding with the first sample (either
the Mann-Whitney U or the Wilcoxon signed-rank T).
pvalue_np: The associated p-value for the chosen alternative from the non-parametric test.
"""
# Parametric and non parametric tests
if len(sample1) == 1:
# One-sample T-Test
stat_t, pvalue_t = stats.ttest_1samp(popmean = sample1, a = sample2, alternative = alternative)
# Wilcoxon signed-rank test
stat_np, pvalue_np = stats.wilcoxon((np.array(sample2) - sample1), alternative = alternative, method='auto')
print(f'T-test: The t statistic is {stat_t:.03f}; and the p-value is {pvalue_t:.03f}.\n')
print(f'Wilcoxon signed-rank test: The T statistic is {stat_np:.03f}; and the p-value is {pvalue_np:.03f}.\n')
elif len(sample2) == 1:
# One-sample T-Test
stat_t, pvalue_t = stats.ttest_1samp(popmean = sample2, a = sample1, alternative = alternative)
# Wilcoxon signed-rank test
stat_np, pvalue_np = stats.wilcoxon((np.array(sample1) - sample2), alternative = alternative, method='auto')
# Print of results
print(f'T-test: The t statistic is {stat_t:.03f}; and the p-value is {pvalue_t:.03f}.\n')
print(f'Wilcoxon signed-rank test: The T statistic is {stat_np:.03f}; and the p-value is {pvalue_np:.03f}.\n')
else:
# Two-sample T-Test
stat_t, pvalue_t = stats.ttest_ind(a = sample1, b = sample2, equal_var = equal_var, alternative = alternative)
# Mann-Whitney U test
stat_np, pvalue_np = stats.mannwhitneyu(x = sample1, y = sample2, alternative = alternative)
# Print of results
print(f'T-test: The t statistic is {stat_t:.03f}; and the p-value is {pvalue_t:.03f}.\n')
print(f'Mann-Whitney U test: The U statistic is {stat_np:.03f}; and the p-value is {pvalue_np:.03f}.\n')
# Print of results
# Interpretation
alpha = 0.05
if pvalue_t > alpha and pvalue_np > alpha:
print('Test interpretation: Fail to reject H0.\n')
elif pvalue_t < alpha and pvalue_np < alpha:
print('Test interpretation: Reject H0.\n')
else:
print('Inconsistent results between the tests.\n')
return stat_t, pvalue_t, stat_np, pvalue_np
Furthermore, the difference between the mean salaries, or absolute effect size, for each data job category was calculated in order to quantify the size of the effect as well as its percentage difference, calculated by dividing the absolute effect size by the average of the mean salaries and multiplying by 100.
Moreover, the confidence intervals for the mean were estimated by means of bootstrapping, as the salary observations do not follow a Gaussian distribution. Nonetheless, for sake of completeness, Cohen's d was also estimated.
Likewise, as the effect size was estimated several times for each data job salary comparison, several functions were defined as follows:
# Function for estimating the Absolute Effect Size
def absolute_effect_size(sample1, sample2):
"""
This functions estimates the effect size of the two mean samples by calculating the
absolute difference between them and its percentage difference.
Parameters
sample1: First sample array.
sample2: Second sample array.
Returns
abs_effect_size: The absolute effect size of the two mean samples.
percen_dif: The percentage difference of the two mean samples.
"""
mean_sample1 = np.mean(sample1)
mean_sample2 = np.mean(sample2)
abs_effect_size = abs(mean_sample1 - mean_sample2)
percen_dif = (abs_effect_size / ((mean_sample1 + mean_sample2)/2)) * 100
return abs_effect_size, percen_dif
# Function for estimating the Standardized Effect Size (Cohen's d)
def cohens_d(sample1, sample2):
"""
This functions estimates the standardized effect size of the two mean samples
by calculating the Cohen's d.
Parameters
sample1: First sample array.
sample2: Second sample array.
Returns
d: The standardized effect size of the two mean samples (Cohen's d).
"""
# Lenght of the samples
n1, n2 = len(sample1), len(sample2)
# Variances of the samples
s1, s2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)
# Pooled standard deviation
s = np.sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))
# Mean of the samples
u1, u2 = np.mean(sample1), np.mean(sample2)
# Cohen's d
d = (u1 - u2) / s
return d
# Function for estimating the bootstrap confidence intervals
def boot_conf_int(sample, alpha = 0.05, iterations = 1000):
"""
This functions estimates the bootstrap confidence intervals of a sample mean.
Parameters
sample: Sample array.
alpha: Significance level. It is 0.05 by default.
iterantions: Number of iterations. It is 1000 by default.
Returns
stats: List with the estimated statistics.
lower_ci: Lower confidence interval.
upper_ci: Upper confidence interval.
"""
n_iterations = iterations
n_size = int(len(sample) * 0.50)
stats = []
for i in range(n_iterations):
new_sample = resample(sample, n_samples=n_size)
mean_stat = np.mean(new_sample)
stats.append(mean_stat)
p = (alpha / 2) * 100
lower_ci = np.percentile(stats, p)
p = ((1 - alpha) + (alpha / 2)) * 100
upper_ci = np.percentile(stats, p)
return stats, lower_ci, upper_ci
# Function for estimating the Effect Size
def effect_size(sample1, sample2, legend1 = "Sample 1", legend2 = "Sample 2", alpha = 0.05):
"""
This functions estimates the effect size of the two mean samples by calculating the
absolute effect size, the Cohen's d, and the bootstrap confidence intervals.
Parameters
sample1: First sample array.
sample2: Second sample array.
legend1: String with the legend for the sample 1.
legend2: String with the legend for the sample 2.
alpha: Significance level. It is 0.05 by default.
Returns
None
"""
# Absolute Effect Size
abs_effect_size, percen_dif = absolute_effect_size(sample1, sample2)
print(f'The mean salary difference and the percentage difference between the two data job categories is: ${abs_effect_size:,.0f} and {percen_dif:,.2f}%, respectively.\n')
if len(sample1) > 1 and len(sample2) > 1:
# Standardized Effect Size (Cohen's d)
d = cohens_d(sample1, sample2)
# Interpretation of Cohen's d
print(f"The Cohen's d between the two data job categories is: {d:,.2f}.")
if d <= 0.20:
print(f"(Small Effect Size)\n")
elif d <= 0.50:
print(f"(Medium Effect Size)\n")
else:
print(f"(Large Effect Size)\n")
# Legends list
legends = [legend1, legend2]
# Adding plot
fig, ax = plt.subplots(figsize = (9, 5))
for index, sample in enumerate([sample1, sample2]):
# Bootstrap confidence intervals
stats, lower_ci, upper_ci = boot_conf_int(sample, alpha = alpha, iterations = 1000)
print(f"{legends[index]}'s Salary --> {(1 - alpha)*100}% Confidence Interval: (${lower_ci:,.0f}, ${upper_ci:,.0f})")
ax.hist(stats, color = sns.color_palette('Blues_r')[3-index],
alpha = 0.65, label = f'{legends[index]}')
print('\n')
plt.title(f'Bootstrap Distributions for {legend1} and {legend2} Salaries\n')
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel('Frequency')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.legend(loc = 'upper right', facecolor = 'white')
plt.savefig(f'Figures/Fig_BootstrapDistributions_{legend1}-{legend2}.png', bbox_inches = 'tight')
plt.show()
To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. These tests were selected instead of the two-sample T-Test with unequal variance and the Mann-Whitney U test since only one salary observation was collected for ML engineer positions. Thus, said salary value served as the mean of the population, just for sake of comparison.
Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.
$$\mathbf{H_0}: \mu_{DR} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{DR} < \mu_{ML}$$$$\alpha = 0.05$$Moreover, as the one-sample T-test and the Wilcoxon signed-rank test were carried out to compare the mean salaries of the two data jobs categories:
# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(DR, ML, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -1.360; and the p-value is 0.116. Wilcoxon signed-rank test: The T statistic is 6.000; and the p-value is 0.219. Test interpretation: Fail to reject H0.
The obtained p-values from both tests are not significant ($p > 0.05$). Therefore, the Data Architect salaries are not significantly lower than that for ML Engineers.
However, as only one salary observation was retrieved for ML engineer positions, just the absolute effect size and the percentage difference was calculated for this case:
# Estimating the Effect Size
effect_size(DR, ML)
The mean salary difference and the percentage difference between the two data job categories is: $10,375 and 17.35%, respectively.
So, from the present analysis, it is possible to conclude that the mean salary difference between ML Engineer and Data Architect positions is neither statistically nor practically significant.
Certainly, a difference of $10,375 MXN per month, or a percentage difference of 17.35%, is not that important when the salary observation for a ML Engineer is $65,000 MXN and the mean salary for **Data Architect** positions is about $54,625 MXN.
Indeed, according to the Reporte del Mercado Laboral de TI México 2023 (Spanish for Mexico IT Labor Market Report 2023) by Olvera (2023), only about 20% of the surveyed IT professionals in Mexico would be willing to a job change for a salary increase of 20% or less; whereas the 79% would be willing to a job change for a salary increase of 20% or more. So, as a heuristic rule, a percentage difference of 20% was selected as a cut-off value for practical significance.
To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.
$$\mathbf{H_0}: \mu_{DE} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{DE} < \mu_{ML}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(DE, ML, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -5.620; and the p-value is 0.000. Wilcoxon signed-rank test: The T statistic is 18.500; and the p-value is 0.000. Test interpretation: Reject H0.
d:\DataJobsMX-Nov2023\.venv\lib\site-packages\scipy\stats\_morestats.py:3414: UserWarning: Exact p-value calculation does not work if there are zeros. Switching to normal approximation.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineer salaries are significantly lower than that for ML Engineers.
# Estimating the Effect Size
effect_size(DE, ML)
The mean salary difference and the percentage difference between the two data job categories is: $19,612 and 35.53%, respectively.
So, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and Data Engineer positions is actually practically significant. Indeed, a difference of $19,612 MXN per month or a percentage difference of 35.53% is important in the Mexican labor market.
Thus, even though the salary difference was found to be not statistically significant, practically it was. That is the reason why an effect size analysis should be carried out along with the hypothesis testing.
To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.
$$\mathbf{H_0}: \mu_{DS} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{DS} < \mu_{ML}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(ML, DS, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -2.311; and the p-value is 0.027. Wilcoxon signed-rank test: The T statistic is 3.000; and the p-value is 0.020. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Scientist salaries are significantly lower than that for ML Engineers.
# Estimating the Effect Size
effect_size(DS, ML)
The mean salary difference and the percentage difference between the two data job categories is: $20,531 and 37.51%, respectively.
So, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and Data Scientist positions is actually practically significant. Indeed, a difference of $20,531 MXN per month or a percentage difference of 37.51% is important in the Mexican labor market.
To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.
$$\mathbf{H_0}: \mu_{BA} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{BA} < \mu_{ML}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(ML, BA, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -10.259; and the p-value is 0.000. Wilcoxon signed-rank test: The T statistic is 1.500; and the p-value is 0.000. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p > 0.05$). Therefore, the Business Analyst salaries are significantly lower than that for ML Engineers.
# Estimating the Effect Size
effect_size(BA, ML)
The mean salary difference and the percentage difference between the two data job categories is: $33,107 and 68.34%, respectively.
So, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and Business Analyst positions is actually practically significant. Indeed, a difference of $33,107 MXN per month or a percentage difference of 68.34% is remarkable in the Mexican labor market.
To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.
$$\mathbf{H_0}: \mu_{BI} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{BI} < \mu_{ML}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(ML, BI, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -11.271; and the p-value is 0.000. Wilcoxon signed-rank test: The T statistic is 1.000; and the p-value is 0.000. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the BI Analyst salaries are significantly lower than that for ML Engineers.
# Estimating the Effect Size
effect_size(ML, BI)
The mean salary difference and the percentage difference between the two data job categories is: $35,053 and 73.84%, respectively.
Thus, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and BI Analyst positions is actually practically significant. Indeed, a difference of $35,053 MXN per month or a percentage difference of 73.84% is remarkable in the Mexican labor market.
To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.
$$\mathbf{H_0}: \mu_{DA} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{DA} < \mu_{ML}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(DA, ML, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -21.190; and the p-value is 0.000. Wilcoxon signed-rank test: The T statistic is 4.000; and the p-value is 0.000. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Analyst salaries are significantly lower than that for ML Engineers.
# Estimating the Effect Size
effect_size(DA, ML)
The mean salary difference and the percentage difference between the two data job categories is: $40,693 and 91.13%, respectively.
Thus, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and BI Analyst positions is actually practically significant. Indeed, a difference of $40,693 MXN per month or a percentage difference of 91.13% is very important in the Mexican labor market.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{DE}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{DE}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, DE, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.101; and the p-value is 0.153. Mann-Whitney U test: The U statistic is 119.000; and the p-value is 0.113. Test interpretation: Fail to reject H0.
The obtained p-values from both tests are not significant ($p > 0.05$). Therefore, the Data Architect salaries are not significantly higher than those for Data Engineers.
# Estimating the Effect Size
effect_size(DR, DE, "Data Architect", "Data Engineer", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $9,238 and 18.47%, respectively. The Cohen's d between the two data job categories is: 0.48. (Medium Effect Size) Data Architect's Salary --> 95.0% Confidence Interval: ($33,750, $72,083) Data Engineer's Salary --> 95.0% Confidence Interval: ($36,472, $54,384)
So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Data Engineer positions is neither statistically nor practically significant.
Certainly, a difference of $9,238 MXN per month and a percentage difference of about 18.47% is not that huge in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are similar.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{DS}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{DS}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, DS, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 0.867; and the p-value is 0.201. Mann-Whitney U test: The U statistic is 31.500; and the p-value is 0.183. Test interpretation: Fail to reject H0.
The obtained p-values from both tests are not significant ($p > 0.05$). Therefore, the Data Architect salaries are not significantly higher than those for Data Scientists.
# Estimating the Effect Size
effect_size(DR, DS, "Data Architect", "Data Scientist", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $10,156 and 20.50%, respectively. The Cohen's d between the two data job categories is: 0.45. (Medium Effect Size) Data Architect's Salary --> 95.0% Confidence Interval: ($36,833, $72,083) Data Scientist's Salary --> 95.0% Confidence Interval: ($23,875, $71,875)
So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Data Scientist positions is neither statistically nor practically significant.
Certainly, a difference of $10,156 MXN per month and a percentage difference of about 20.50% is not that huge in the Mexican labor market for this kind of positions, taking into account that the mean salaries for Data Architect and Data Scientist positions are $54,625 MXN and $44,468 MXN, respectively. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are similar.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{BA}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{BA}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, BA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.744; and the p-value is 0.015. Mann-Whitney U test: The U statistic is 115.500; and the p-value is 0.007. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Architect salaries are significantly higher than those for Business Analysts.
# Estimating the Effect Size
effect_size(DR, BA, "Data Architect", "Business Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $22,732 and 52.55%, respectively. The Cohen's d between the two data job categories is: 1.41. (Large Effect Size) Data Architect's Salary --> 95.0% Confidence Interval: ($30,917, $70,000) Business Analyst's Salary --> 95.0% Confidence Interval: ($22,776, $41,412)
So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Business Analyst positions is both statistically and practically significant.
Indeed, a difference of $22,732 MXN per month and a percentage difference of about 52.55% is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for each position are very different.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{BI}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{BI}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, BI, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.995; and the p-value is 0.010. Mann-Whitney U test: The U statistic is 101.000; and the p-value is 0.007. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Architect salaries are significantly higher than those for BI Analysts.
# Estimating the Effect Size
effect_size(DR, BI, "Data Architect", "BI Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $24,678 and 58.36%, respectively. The Cohen's d between the two data job categories is: 1.64. (Large Effect Size) Data Architect's Salary --> 95.0% Confidence Interval: ($33,750, $72,500) BI Analyst's Salary --> 95.0% Confidence Interval: ($22,749, $39,650)
So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and BI Analyst positions is both statistically and practically significant.
Indeed, a difference of $24,678 MXN per month and a percentage difference of about 58.36% is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for each position are very different.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{DA}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 3.854; and the p-value is 0.005. Mann-Whitney U test: The U statistic is 345.000; and the p-value is 0.001. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Architect salaries are significantly higher than those for Data Analysts.
# Estimating the Effect Size
effect_size(DR, DA, "Data Architect", "DA Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $30,318 and 76.82%, respectively. The Cohen's d between the two data job categories is: 1.94. (Large Effect Size) Data Architect's Salary --> 95.0% Confidence Interval: ($33,750, $72,083) DA Analyst's Salary --> 95.0% Confidence Interval: ($19,596, $29,860)
So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Data Analyst positions is both statistically and practically significant.
Indeed, a difference of $30,318 MXN per month and a percentage difference of about 76.82% is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for each position are also very different between each other.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{DS}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{DS}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, DS, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 0.096; and the p-value is 0.463. Mann-Whitney U test: The U statistic is 127.500; and the p-value is 0.401. Test interpretation: Fail to reject H0.
The obtained p-values from the T-test and the Mann-Whitney U test were not significant ($p > 0.05$). Therefore, the Data Engineer salaries are not significantly higher than those for Data Scientists.
# Effect size analysis
effect_size(DE, DS, "Data Engineer", "Data Scientist", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $919 and 2.04%, respectively. The Cohen's d between the two data job categories is: 0.04. (Small Effect Size) Data Engineer's Salary --> 95.0% Confidence Interval: ($35,539, $54,317) Data Scientist's Salary --> 95.0% Confidence Interval: ($24,562, $71,897)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and Data Scientist positions is also practically not significant.
Certainly, a difference of $919 MXN per month and a percentange difference of about 2.04% is neglectable in the Mexican labor market for technology. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions overlap in a large extent as it is clearly shown in the plot above.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{BA}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{BA}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, BA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.839; and the p-value is 0.003. Mann-Whitney U test: The U statistic is 487.500; and the p-value is 0.005. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineer salaries are significantly higher than those for Business Analysts.
# Effect size analysis
effect_size(DE, BA, "Data Engineer", "Business Analysts", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $13,494 and 34.92%, respectively. The Cohen's d between the two data job categories is: 0.77. (Large Effect Size) Data Engineer's Salary --> 95.0% Confidence Interval: ($36,665, $55,675) Business Analysts's Salary --> 95.0% Confidence Interval: ($23,784, $41,184)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and Business Analyst positions is practically significant. Indeed, a difference of about $13,494 MXN per month and a percentage difference of about 34.92% is relevant in the Mexican labor market.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{BI}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{BI}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, BI, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 3.303; and the p-value is 0.001. Mann-Whitney U test: The U statistic is 453.500; and the p-value is 0.001. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineer salaries are significantly higher than those for BI Analysts.
# Effect size analysis
effect_size(DE, BI, "Data Engineer", "BI Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $15,441 and 40.99%, respectively. The Cohen's d between the two data job categories is: 0.90. (Large Effect Size) Data Engineer's Salary --> 95.0% Confidence Interval: ($36,350, $55,168) BI Analyst's Salary --> 95.0% Confidence Interval: ($22,488, $39,125)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and BI Analyst positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $15,441 MXN per month and a percentage difference of about 40.99% is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are different from each other as shown in the plot above.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{DA}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 5.293; and the p-value is 0.000. Mann-Whitney U test: The U statistic is 1593.500; and the p-value is 0.000. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineer salaries are significantly higher than those for Data Analysts.
# Effect size analysis
effect_size(DE, DA, "Data Engineer", "Data Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $21,081 and 60.50%, respectively. The Cohen's d between the two data job categories is: 1.27. (Large Effect Size) Data Engineer's Salary --> 95.0% Confidence Interval: ($35,898, $54,400) Data Analyst's Salary --> 95.0% Confidence Interval: ($19,518, $30,084)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and Data Analyst positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $21,081 MXN per month and a percentage difference of about 60.50% is large in the Mexican labor market.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DS} ≤ \mu_{BA}$$$$\mathbf{H_1}: \mu_{DS} > \mu_{BA}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DS, BA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.331; and the p-value is 0.108. Mann-Whitney U test: The U statistic is 122.500; and the p-value is 0.088. Test interpretation: Fail to reject H0.
The obtained p-values from the T-test and the Mann-Whitney U test were not significant ($p > 0.05$). Therefore, the Data Scientist salaries are not significantly higher than those for Business Analysts.
# Effect size analysis
effect_size(DS, BA, "Data Scientist", "Business Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $12,575 and 32.94%, respectively. The Cohen's d between the two data job categories is: 0.69. (Large Effect Size) Data Scientist's Salary --> 95.0% Confidence Interval: ($24,812, $70,894) Business Analyst's Salary --> 95.0% Confidence Interval: ($22,657, $41,298)
In this case, even though the mean salary difference between Data Scientist and Business Analyst positions is not statistically significant, a difference of $12,575 MXN per month and a percentage difference of about 32.94% is not neglectable.
Therefore, notwithstanding with their lack of statistical significance, based on the effect size analysis, it is possible to state that there is a significant practical difference between the mean salary difference of the Data Scientist and the Business Analyst positions. That is the reason why an effect size analysis should be carried out along with the hypothesis testing.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DS} ≤ \mu_{BI}$$$$\mathbf{H_1}: \mu_{DS} > \mu_{BI}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DS, BI, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.543; and the p-value is 0.079. Mann-Whitney U test: The U statistic is 119.500; and the p-value is 0.024. Inconsistent results between the tests.
The obtained p-value from the T-test was not significant ($p > 0.05$); whereas the obtained p-value from the Mann-Whitney U test was significant ($p < 0.05$).
In this context, taking into account that the Data Scientist and BI Analysts salary observations do not follow a normal distribution, the results from the Mann-Whitney U test could be deemed as more accurate.
Therefore, all in all, the Data Scientist salaries are significantly higher than those for BI Analysts.
# Effect size analysis
effect_size(DS, BI, "Data Scientist", "BI Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $14,522 and 39.03%, respectively. The Cohen's d between the two data job categories is: 0.82. (Large Effect Size) Data Scientist's Salary --> 95.0% Confidence Interval: ($25,117, $69,961) BI Analyst's Salary --> 95.0% Confidence Interval: ($22,643, $38,651)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Scientist and BI Analyst positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $14,522 MXN per month and a percentage difference of about 39.03% is non-neglectable in the Mexican labor market.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{DS} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{DS} > \mu_{DA}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DS, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.219; and the p-value is 0.029. Mann-Whitney U test: The U statistic is 411.500; and the p-value is 0.003. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Scientist salaries are significantly higher than those for Data Analysts.
# Effect size analysis
effect_size(DS, DA, "Data Scientist", "Data Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $20,162 and 58.63%, respectively. The Cohen's d between the two data job categories is: 1.21. (Large Effect Size) Data Scientist's Salary --> 95.0% Confidence Interval: ($26,053, $69,375) Data Analyst's Salary --> 95.0% Confidence Interval: ($19,430, $29,902)
From the effect size analysis, it is possible to conclude that the mean salary difference between Data Scientist and Data Analyst positions is both statistically and practically significant.
Certainly, a difference of $20,162 MXN per month and a percentage difference of about 58.63% is quite large in the Mexican labor market.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{BA} ≤ \mu_{BI}$$$$\mathbf{H_1}: \mu_{BA} > \mu_{BI}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(BA, BI, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 0.434; and the p-value is 0.333. Mann-Whitney U test: The U statistic is 252.000; and the p-value is 0.300. Test interpretation: Fail to reject H0.
The obtained p-values from both tests are not significant ($p > 0.05$). Therefore, the Business Analyst salaries are not significantly higher than those for BI Analysts.
# Effect size analysis
effect_size(BA, BI, "Business Analyst", "BI Analys", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $1,947 and 6.30%, respectively. The Cohen's d between the two data job categories is: 0.13. (Small Effect Size) Business Analyst's Salary --> 95.0% Confidence Interval: ($23,594, $41,506) BI Analys's Salary --> 95.0% Confidence Interval: ($22,560, $39,100)
From the effect size analysis, it is possible to conclude that the mean salary difference between Business Analyst and BI Analyst positions is also not practically significant.
Certainly, a difference of $1,947 MXN per month and a percentange difference of about 6.30% is neglectable in the Mexican labor market.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{BA} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{BA} > \mu_{DA}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(BA, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.020; and the p-value is 0.025. Mann-Whitney U test: The U statistic is 999.000; and the p-value is 0.006. Test interpretation: Reject H0.
The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Business Analyst salaries are significantly higher than those for Data Analysts.
# Effect size analysis
effect_size(BA, DA, "Business Analyst", "Data Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $7,587 and 27.00%, respectively. The Cohen's d between the two data job categories is: 0.49. (Medium Effect Size) Business Analyst's Salary --> 95.0% Confidence Interval: ($22,987, $40,867) Data Analyst's Salary --> 95.0% Confidence Interval: ($19,582, $29,529)
From the effect size analysis, it is possible to conclude that the salary difference between Business Analyst and Data Analyst positions is, not only statistically significant, but also practically significant.
Certainly, a difference of $7,587 MXN per month and a percentange difference of about 27.00% is non-neglectable in the Mexican labor market.
To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.
$$\mathbf{H_0}: \mu_{BI} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{BI} > \mu_{DA}$$$$\alpha = 0.05$$# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(BI, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.543; and the p-value is 0.066. Mann-Whitney U test: The U statistic is 869.000; and the p-value is 0.008. Inconsistent results between the tests.
The obtained p-value from the T-test was not significant ($p > 0.05$); whereas the obtained p-value from the Mann-Whitney U test was significant ($p < 0.05$).
In this context, taking into account that the BI Analyst and Data Analysts salary observations do not follow a normal distribution, the results from the Mann-Whitney U test could be deemed as more accurate.
Therefore, all in all, the BI Analyst salaries are significantly higher than those for Data Analysts.
# Effect size analysis
effect_size(BI, DA, "BI Analyst", "Data Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $5,640 and 20.79%, respectively. The Cohen's d between the two data job categories is: 0.38. (Medium Effect Size) BI Analyst's Salary --> 95.0% Confidence Interval: ($22,463, $38,381) Data Analyst's Salary --> 95.0% Confidence Interval: ($19,475, $29,708)
From the effect size analysis, it is possible to conclude that, even though the salary difference between BI Analyst and Data Analyst positions is statistically significant, it is not practically significant, as a difference of $5,640 MXN per month and a percentage difference of about 20.79% is not very important in the Mexican labor market for technology.
Again, here it lies the importance of peforming a effect size analysis along with the hypothesis testing.
The data job category with the highest salaries in the Mexican labor market in November 2023 according to the OCC website were Data Architect and ML Engineer. Indeed, the average salary for Data Architect positions was not signicantly lower than for ML Engineers. Thus, the present study's hypothesis is rejected. However, this result must be taken with caution as only one salary observation could be retrieved for the latter position.
On the other hand, the data job category most demanded in the Mexican labor market was Data Analyst; even though it was also the one with the lowest salary. Also, this data job category is the most demanded across the different Mexican states, whereas ML Engineer and Data Architect positions were the most concentrated in certain locations, namely, Ciudad de México, Nuevo León, and Jalisco.
Moreover, Ciudad de México was the location where it is possible to find the highest jobs demand. However, Estado de México and Jalisco were locations along with remote positions where the demand of data jobs is the highest after the capital city. Similarly, in those locations the highest salaries can be found.
Furthermore, the companies with the greatest demand of data positions were Bairesdev, Banamex, Pepsico and Softek; nevertheless, the organizations that offered the highest salaries were Ecosistemex, Caspex Corp, Addon Technologies, Enterprise Solutions, and Softek, which correspond to recruiting agencies and tech consulting firms.
Additionally, it was also found that the data jobs demand from some companies spread across several Mexican states such as Bairesdev, Pepsico or Softek, and that there are some well-known organizations whose data jobs demand is not located in the capital region, such as Johnson Controls (Nuevo León).
The results of the present study suggest that Data Analyst, Business Analyst, and Data Engineer positions were more demanded across different organizations. On the contrary, Data Architect and, certainly, ML Engineer vacancies could only be found in more specific organizations like tech consultancy firms or banks.
Finally, regarding the limitations of the present study, it is important to bear in mind that the data was collected solely from the OCC website and only for a very short period of time. Thus, very few observations were obtained for some of the least demanded data jobs categories: Data Scientist, Data Architect, and ML Engineer. Also, the collected data mostly corresponded to Ciudad de México, Nuevo León, Estado de México, and Jalisco, and no distinction was made among entry level, middle and senior positions. Thus, as future perspectives, it would be advisable to gather data from more job websites, retrieve information for a longer time span, make a distinction among entry level, middle and senior positions, and collect more salary data for Data Scientist, Data Architect, and ML Engineer positions as well as for other Mexican states.
# Compressing all figures in a ZIP file
!zip -r figures.zip Figures
updating: Figures/ (260 bytes security) (stored 0%) updating: Figures/Fig10_AverageSalaryPerDataJobCategory.png (172 bytes security) (deflated 17%) updating: Figures/Fig10_AverageSalaryPerDataJobCategory2.png (172 bytes security) (deflated 26%) updating: Figures/Fig10_AverageSalaryPerDataJobCategory2.svg (172 bytes security) (deflated 86%) updating: Figures/Fig11_SalaryPerDataJobCategory.png (172 bytes security) (deflated 22%) updating: Figures/Fig11_SalaryPerDataJobCategory2.png (172 bytes security) (deflated 16%) updating: Figures/Fig11_SalaryPerDataJobCategory2.svg (172 bytes security) (deflated 92%) updating: Figures/Fig12_MeanMedianSalaryPerDataJob.png (172 bytes security) (deflated 14%) updating: Figures/Fig12_MeanMedianSalaryPerDataJob2.png (172 bytes security) (deflated 17%) updating: Figures/Fig12_MeanMedianSalaryPerDataJob2.svg (172 bytes security) (deflated 87%) updating: Figures/Fig13_SalaryPerLocationAndDataJobCategory.png (172 bytes security) (deflated 19%) updating: Figures/Fig13_SalaryPerLocationAndDataJobCategory2.png (172 bytes security) (deflated 19%) updating: Figures/Fig13_SalaryPerLocationAndDataJobCategory2.svg (172 bytes security) (deflated 83%) updating: Figures/Fig14_Top20SalaryPerCompany.png (172 bytes security) (deflated 18%) updating: Figures/Fig14_Top20SalaryPerCompany2.png (172 bytes security) (deflated 23%) updating: Figures/Fig14_Top20SalaryPerCompany2.svg (172 bytes security) (deflated 90%) updating: Figures/Fig15_SalaryPerCompanyAndDataJobCategory.png (172 bytes security) (deflated 17%) updating: Figures/Fig15_SalaryPerCompanyAndDataJobCategory2.png (172 bytes security) (deflated 16%) updating: Figures/Fig15_SalaryPerCompanyAndDataJobCategory2.svg (172 bytes security) (deflated 73%) updating: Figures/Fig16_VacanciesHighestSalaries.png (172 bytes security) (deflated 13%) updating: Figures/Fig17_DataJobsSalaryDistribution.png (172 bytes security) (deflated 27%) updating: Figures/Fig17_DataJobsSalaryDistribution2.png (172 bytes security) (deflated 39%) updating: Figures/Fig17_DataJobsSalaryDistribution2.svg (172 bytes security) (deflated 86%) updating: Figures/Fig18_SalaryDistributionsPerDataJobCategory.png (172 bytes security) (deflated 31%) updating: Figures/Fig18_SalaryDistributionsPerDataJobCategory2.png (172 bytes security) (deflated 28%) updating: Figures/Fig18_SalaryDistributionsPerDataJobCategory2.svg (172 bytes security) (deflated 93%) updating: Figures/Fig1_DemandOfDataJobsPerCategory.png (172 bytes security) (deflated 10%) updating: Figures/Fig1_DemandOfDataJobsPerCategory2.png (172 bytes security) (deflated 13%) updating: Figures/Fig1_DemandOfDataJobsPerCategory2.svg (172 bytes security) (deflated 82%) updating: Figures/Fig2_DemandOfDataJobsPerMexicanState.png (172 bytes security) (deflated 11%) updating: Figures/Fig2_DemandOfDataJobsPerMexicanState.svg (172 bytes security) (deflated 57%) updating: Figures/Fig3_DemandPerDataJobCategoryInTopLocations1.png (172 bytes security) (deflated 17%) updating: Figures/Fig3_DemandPerDataJobCategoryInTopLocations2.png (172 bytes security) (deflated 26%) updating: Figures/Fig3_DemandPerDataJobCategoryInTopLocations2.svg (172 bytes security) (deflated 89%) updating: Figures/Fig4_DemandPerLocationAndDataJobCategory1.png (172 bytes security) (deflated 20%) updating: Figures/Fig4_DemandPerLocationAndDataJobCategory2.png (172 bytes security) (deflated 17%) updating: Figures/Fig4_DemandPerLocationAndDataJobCategory2.svg (172 bytes security) (deflated 78%) updating: Figures/Fig5_TopCompaniesDemandingDataJobs.png (172 bytes security) (deflated 25%) updating: Figures/Fig5_TopCompaniesDemandingDataJobs.svg (172 bytes security) (deflated 87%) updating: Figures/Fig6_DemandPerCompanyAndDataJobCategory.png (172 bytes security) (deflated 18%) updating: Figures/Fig6_DemandPerCompanyAndDataJobCategory2.png (172 bytes security) (deflated 16%) updating: Figures/Fig6_DemandPerCompanyAndDataJobCategory2.svg (172 bytes security) (deflated 79%) updating: Figures/Fig7_DemandPerCompanyAndLocationTop30.png (172 bytes security) (deflated 18%) updating: Figures/Fig7_DemandPerCompanyAndLocationTop30.svg (172 bytes security) (deflated 76%) updating: Figures/Fig7_DemandPerCompanyAndLocationTop30_2.png (172 bytes security) (deflated 17%) updating: Figures/Fig8_DemandPerCompanyAndLocation.png (172 bytes security) (deflated 24%) updating: Figures/Fig8_DemandPerCompanyAndLocation2.png (172 bytes security) (deflated 23%) updating: Figures/Fig8_DemandPerCompanyAndLocation2.svg (172 bytes security) (deflated 70%) updating: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory.png (172 bytes security) (deflated 19%) updating: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory2.png (172 bytes security) (deflated 26%) updating: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory2.svg (172 bytes security) (deflated 85%) updating: Figures/Fig_BootstrapDistributions_BI Analyst-Data Analyst.png (172 bytes security) (deflated 20%) updating: Figures/Fig_BootstrapDistributions_Business Analyst-BI Analys.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Business Analyst-Data Analyst.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Architect-BI Analyst.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Architect-Business Analyst.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Architect-DA Analyst.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Architect-Data Engineer.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Architect-Data Scientist.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Engineer-BI Analyst.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Engineer-Business Analysts.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Engineer-Data Analyst.png (172 bytes security) (deflated 20%) updating: Figures/Fig_BootstrapDistributions_Data Engineer-Data Scientist.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Scientist-BI Analyst.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Scientist-Business Analyst.png (172 bytes security) (deflated 19%) updating: Figures/Fig_BootstrapDistributions_Data Scientist-Data Analyst.png (172 bytes security) (deflated 19%) updating: Figures/Header.png (172 bytes security) (deflated 1%) adding: Figures/Fig16_VacanciesHighestSalaries2.png (172 bytes security) (deflated 13%) adding: Figures/Fig16_VacanciesHighestSalaries2.svg (172 bytes security) (deflated 91%)
# Code for composing the Python Requirements File
!pip freeze > requirements.txt
# End